[GIT,PULL] Introduce try_alloc_pages for 6.15

Message ID	20250327145159.99799-1-alexei.starovoitov@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Alexei Starovoitov <alexei.starovoitov@gmail.com> To: torvalds@linux-foundation.org Cc: bpf@vger.kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org, akpm@linux-foundation.org, peterz@infradead.org, vbabka@suse.cz, bigeasy@linutronix.de, rostedt@goodmis.org, mhocko@suse.com, shakeel.butt@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] Introduce try_alloc_pages for 6.15 Date: Thu, 27 Mar 2025 10:51:59 -0400 Message-Id: <20250327145159.99799-1-alexei.starovoitov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[GIT,PULL] Introduce try_alloc_pages for 6.15 \| expand [GIT,PULL] Introduce try_alloc_pages for 6.15

Alexei Starovoitov March 27, 2025, 2:51 p.m. UTC

Hi Linus,

The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b:

  Linux 6.14-rc1 (2025-02-02 15:39:26 -0800)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages

for you to fetch changes up to f90b474a35744b5d43009e4fab232e74a3024cae:

  mm: Fix the flipped condition in gfpflags_allow_spinning() (2025-03-15 11:18:19 -0700)

----------------------------------------------------------------
Please pull after main MM changes.

The pull includes work from Sebastian, Vlastimil and myself
with a lot of help from Michal and Shakeel.
This is a first step towards making kmalloc reentrant to get rid
of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc.
These patches make page allocator safe from any context.

Vlastimil kicked off this effort at LSFMM 2024:
https://lwn.net/Articles/974138/
and we continued at LSFMM 2025:
https://lore.kernel.org/all/CAADnVQKfkGxudNUkcPJgwe3nTZ=xohnRshx9kLZBTmR_E1DFEg@mail.gmail.com/

Why:
SLAB wrappers bind memory to a particular subsystem
making it unavailable to the rest of the kernel.
Some BPF maps in production consume Gbytes of preallocated
memory. Top 5 in Meta: 1.5G, 1.2G, 1.1G, 300M, 200M.
Once we have kmalloc that works in any context BPF map
preallocation won't be necessary.

How:
Synchronous kmalloc/page alloc stack has multiple
stages going from fast to slow:
cmpxchg16 -> slab_alloc -> new_slab -> alloc_pages ->
rmqueue_pcplist -> __rmqueue.
rmqueue_pcplist was already relying on trylock.
This set changes rmqueue_bulk/rmqueue_buddy to attempt
a trylock and return ENOMEM if alloc_flags & ALLOC_TRYLOCK.
Then it wraps this functionality into try_alloc_pages() helper.
We make sure that the logic is sane in PREEMPT_RT.
End result: try_alloc_pages()/free_pages_nolock() are
safe to call from any context.
try_kmalloc() for any context with similar trylock
approach will follow. It will use try_alloc_pages()
when slab needs a new page.
Though such try_kmalloc/page_alloc() is an opportunistic
allocator, this design ensures that the probability of
successful allocation of small objects (up to one
page in size) is high.

Even before we have try_kmalloc(), we already use
try_alloc_pages() in BPF arena implementation and it's
going to be used more extensively in BPF.

Once the set was applied to bpf-next we ran into two
two small conflicts with MM tree as reported by Stephen:
https://lore.kernel.org/bpf/20250311120422.1d9a8f80@canb.auug.org.au/
https://lore.kernel.org/bpf/20250312145247.380c2aa5@canb.auug.org.au/

So Andrew suggested to keep thing as-is instead of moving
patchset between the trees before merge window:
https://lore.kernel.org/all/20250317132710.fbcde1c8bb66f91f36e78c89@linux-foundation.org/

Note "locking/local_lock: Introduce localtry_lock_t" patch is
later used in Vlastimil's sheaves and in Shakeel's changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
----------------------------------------------------------------
Alexei Starovoitov (6):
      mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
      mm, bpf: Introduce free_pages_nolock()
      memcg: Use trylock to access memcg stock_lock.
      mm, bpf: Use memcg in try_alloc_pages().
      bpf: Use try_alloc_pages() to allocate pages for bpf needs.
      Merge branch 'bpf-mm-introduce-try_alloc_pages'

Sebastian Andrzej Siewior (1):
      locking/local_lock: Introduce localtry_lock_t

Vlastimil Babka (1):
      mm: Fix the flipped condition in gfpflags_allow_spinning()

 include/linux/bpf.h                 |   2 +-
 include/linux/gfp.h                 |  23 ++++
 include/linux/local_lock.h          |  70 +++++++++++++
 include/linux/local_lock_internal.h | 146 ++++++++++++++++++++++++++
 include/linux/mm_types.h            |   4 +
 include/linux/mmzone.h              |   3 +
 kernel/bpf/arena.c                  |   5 +-
 kernel/bpf/syscall.c                |  23 +++-
 lib/stackdepot.c                    |  10 +-
 mm/internal.h                       |   1 +
 mm/memcontrol.c                     |  53 +++++++---
 mm/page_alloc.c                     | 203 +++++++++++++++++++++++++++++++++---
 mm/page_owner.c                     |   8 +-
 13 files changed, 509 insertions(+), 42 deletions(-)

Linus Torvalds March 30, 2025, 8:42 p.m. UTC | #1

On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> The pull includes work from Sebastian, Vlastimil and myself
> with a lot of help from Michal and Shakeel.
> This is a first step towards making kmalloc reentrant to get rid
> of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc.
> These patches make page allocator safe from any context.

So I've pulled this too, since it looked generally fine.

The one reaction I had is that when you basically change

        spin_lock_irqsave(&zone->lock, flags);

into

        if (!spin_trylock_irqsave(&zone->lock, flags)) {
                if (unlikely(alloc_flags & ALLOC_TRYLOCK))
                        return NULL;
                spin_lock_irqsave(&zone->lock, flags);
        }

we've seen bad cache behavior for this kind of pattern in other
situations: if the "try" fails, the subsequent "do the lock for real"
case now does the wrong thing, in that it will immediately try again
even if it's almost certainly just going to fail - causing extra write
cache accesses.

So typically, in places that can see contention, it's better to either do

 (a) trylock followed by a slowpath that takes the fact that it was
locked into account and does a read-only loop until it sees otherwise

     This is, for example, what the mutex code does with that
__mutex_trylock() -> mutex_optimistic_spin() pattern, but our
spinlocks end up doing similar things (ie "trylock" followed by
"release irq and do the 'relax loop' thing).

or

 (b) do the trylock and lock separately, ie

        if (unlikely(alloc_flags & ALLOC_TRYLOCK)) {
                if (!spin_trylock_irqsave(&zone->lock, flags))
                        return NULL;
        } else
                spin_lock_irqsave(&zone->lock, flags);

so that you don't end up doing two cache accesses for ownership that
can cause extra bouncing.

I'm not sure this matters at all in the allocation path - contention
may simply not be enough of an issue, and the trylock is purely about
"unlikely NMI worries", but I do worry that you might have made the
normal case slower.

It's easily fixable later if it ends up being the case, so I don't
worry too much about it, but I did want to mention it since going
through the code made me react to it.

                Linus

Linus Torvalds March 30, 2025, 8:56 p.m. UTC | #2

On Sun, 30 Mar 2025 at 13:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The one reaction I had is that when you basically change

Oh, actually, two reactions now that I fixed up the merge build issue
which forced me to look at the function naming.

That 'localtry_lock_irqsave()' naming is horrendous.

The "try" part in it makes me think it's a trylock. But no, the
"localtry" just comes from the lock naming, and the trylock version is
localtry_trylock_irqsave.

That's horrible.

I missed that on the first read-though, and I'm not unpulling it
because the code generally makes sense.

But I do think that the lock name needs fixing.

"localtry_lock_t" is not a good name, and spreading that odd
"localtry" into the actual (non-try) locking functions makes the
naming actively insane.

If the *only* operation you could do on the lock was "trylock", then
"localtry" would be fine. Then the lock literally is a "only try"
thing. But as it is, the naming now ends up actively broken.

Honestly, the lock name should probably reflect the fact that it can
be used from any context (with a "trylock"), not about the trylock
part itself.

So maybe "nmisafe_local_lock_t" or something in that vein?

Please fix this up, There aren't *that* many users of
"localtry_xyzzy", let's get this fixed before there are more of them.

             Linus

pr-tracker-bot@kernel.org March 30, 2025, 9:05 p.m. UTC | #3

The pull request you sent on Thu, 27 Mar 2025 10:51:59 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/aa918db707fba507e85217961643281ee8dfb2ed

Thank you!

Alexei Starovoitov March 30, 2025, 9:30 p.m. UTC | #4

On Sun, Mar 30, 2025 at 1:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > The pull includes work from Sebastian, Vlastimil and myself
> > with a lot of help from Michal and Shakeel.
> > This is a first step towards making kmalloc reentrant to get rid
> > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc.
> > These patches make page allocator safe from any context.
>
> So I've pulled this too, since it looked generally fine.

Thanks!

> The one reaction I had is that when you basically change
>
>         spin_lock_irqsave(&zone->lock, flags);
>
> into
>
>         if (!spin_trylock_irqsave(&zone->lock, flags)) {
>                 if (unlikely(alloc_flags & ALLOC_TRYLOCK))
>                         return NULL;
>                 spin_lock_irqsave(&zone->lock, flags);
>         }
>
> we've seen bad cache behavior for this kind of pattern in other
> situations: if the "try" fails, the subsequent "do the lock for real"
> case now does the wrong thing, in that it will immediately try again
> even if it's almost certainly just going to fail - causing extra write
> cache accesses.
>
> So typically, in places that can see contention, it's better to either do
>
>  (a) trylock followed by a slowpath that takes the fact that it was
> locked into account and does a read-only loop until it sees otherwise
>
>      This is, for example, what the mutex code does with that
> __mutex_trylock() -> mutex_optimistic_spin() pattern, but our
> spinlocks end up doing similar things (ie "trylock" followed by
> "release irq and do the 'relax loop' thing).

Right,
__mutex_trylock(lock) -> mutex_optimistic_spin() pattern is
equivalent to 'pending' bit spinning in qspinlock.

> or
>
>  (b) do the trylock and lock separately, ie
>
>         if (unlikely(alloc_flags & ALLOC_TRYLOCK)) {
>                 if (!spin_trylock_irqsave(&zone->lock, flags))
>                         return NULL;
>         } else
>                 spin_lock_irqsave(&zone->lock, flags);
>
> so that you don't end up doing two cache accesses for ownership that
> can cause extra bouncing.

Ok, I will switch to above.

> I'm not sure this matters at all in the allocation path - contention
> may simply not be enough of an issue, and the trylock is purely about
> "unlikely NMI worries", but I do worry that you might have made the
> normal case slower.

We actually did see zone->lock being contended in production.
Last time the culprit was an inadequate per-cpu caching and
these series in 6.11 fixed it:
https://lwn.net/Articles/947900/
I don't think we've seen it contended in the newer kernels.

Johannes, pls correct me if I'm wrong.

But to avoid being finger pointed, I'll switch to checking alloc_flags
first. It does seem a better trade off to avoid cache bouncing because
of 2nd cmpxchg. Though when I wrote it this way I convinced myself and
others that it's faster to do trylock first to avoid branch misprediction.

Alexei Starovoitov March 30, 2025, 9:49 p.m. UTC | #5

On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I do think that the lock name needs fixing.
>
> "localtry_lock_t" is not a good name, and spreading that odd
> "localtry" into the actual (non-try) locking functions makes the
> naming actively insane.
>
> If the *only* operation you could do on the lock was "trylock", then
> "localtry" would be fine. Then the lock literally is a "only try"
> thing. But as it is, the naming now ends up actively broken.
>
> Honestly, the lock name should probably reflect the fact that it can
> be used from any context (with a "trylock"), not about the trylock
> part itself.
>
> So maybe "nmisafe_local_lock_t" or something in that vein?
>
> Please fix this up, There aren't *that* many users of
> "localtry_xyzzy", let's get this fixed before there are more of them.

Ok. Agree with the reasoning that the name doesn't quite fit.

nmisafe_local_lock_t name works for me,
though nmisafe_local_lock_irqsave() is a bit verbose.

Don't have better name suggestions at the moment.

Sebastian, Vlastimil,
what do you prefer ?

Linus Torvalds March 30, 2025, 10:08 p.m. UTC | #6

On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> But to avoid being finger pointed, I'll switch to checking alloc_flags
> first. It does seem a better trade off to avoid cache bouncing because
> of 2nd cmpxchg. Though when I wrote it this way I convinced myself and
> others that it's faster to do trylock first to avoid branch misprediction.

Yes, the really hot paths (ie core locking) do the "trylock -> read
spinning" for that reason. Then for the normal case, _only_ the
trylock is in the path, and that's the best of both worlds.

And in practice, the "do two compare-and-exchange" operations actually
does work fine, because the cacheline will generally be sticky enough
that you don't actually get many extra cachline bouncing.

So I'm not sure it matters in the end, but I did react to it.

             Linus

[GIT,PULL] Introduce try_alloc_pages for 6.15

Pull-request

Message

Comments