diff mbox series

[bpf-next] bpf: Add a retry after refilling the free list when unit_alloc() fails

Message ID 20250212084851.150169-1-changwoo@igalia.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next] bpf: Add a retry after refilling the free list when unit_alloc() fails | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 13 of 13 maintainers
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: Missing a blank line after declarations
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / GCC BPF
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-11 success Logs for aarch64-gcc / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-12 success Logs for aarch64-gcc / veristat-meta
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / GCC BPF
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-18 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for s390x-gcc / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-21 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-20 success Logs for s390x-gcc / veristat-meta
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-17 / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-17 / veristat-meta
bpf/vmtest-bpf-next-VM_Test-44 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-next-VM_Test-43 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-50 success Logs for x86_64-llvm-18 / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-51 success Logs for x86_64-llvm-18 / veristat-meta
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / GCC BPF / GCC BPF
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-gcc / veristat-kernel / x86_64-gcc veristat_kernel
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / GCC BPF / GCC BPF
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-gcc / veristat-meta / x86_64-gcc veristat_meta
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / GCC BPF / GCC BPF
bpf/vmtest-bpf-next-VM_Test-45 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-49 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-46 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-47 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-48 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18

Commit Message

Changwoo Min Feb. 12, 2025, 8:48 a.m. UTC
When there is no entry in the free list (c->free_llist), unit_alloc()
fails even when there is available memory in the system, causing allocation
failure in various BPF calls -- such as bpf_mem_alloc() and
bpf_cpumask_create().

Such allocation failure can happen, especially when a BPF program tries many
allocations -- more than a delta between high and low watermarks -- in an
IRQ-disabled context.

To address the problem, when there is no free entry, refill one entry on the
free list (alloc_bulk) and then retry the allocation procedure on the free
list. Note that since some callers of unit_alloc() do not allow to block
(e.g., bpf_cpumask_create), allocate the additional free entry in an atomic
manner (atomic = true in alloc_bulk).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/bpf/memalloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Song Liu Feb. 12, 2025, 6:33 p.m. UTC | #1
On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote:
>
> When there is no entry in the free list (c->free_llist), unit_alloc()
> fails even when there is available memory in the system, causing allocation
> failure in various BPF calls -- such as bpf_mem_alloc() and
> bpf_cpumask_create().
>
> Such allocation failure can happen, especially when a BPF program tries many
> allocations -- more than a delta between high and low watermarks -- in an
> IRQ-disabled context.

Can we add a selftests for this scenario?

>
> To address the problem, when there is no free entry, refill one entry on the
> free list (alloc_bulk) and then retry the allocation procedure on the free
> list. Note that since some callers of unit_alloc() do not allow to block
> (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic
> manner (atomic = true in alloc_bulk).
>
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>  kernel/bpf/memalloc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 889374722d0a..22fe9cfb2b56 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -784,6 +784,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>         struct llist_node *llnode = NULL;
>         unsigned long flags;
>         int cnt = 0;
> +       bool retry = false;

"retry = false;" reads weird to me. Maybe rename it as "retried"?

>
>         /* Disable irqs to prevent the following race for majority of prog types:
>          * prog_A
> @@ -795,6 +796,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>          * Use per-cpu 'active' counter to order free_list access between
>          * unit_alloc/unit_free/bpf_mem_refill.
>          */
> +retry_alloc:
>         local_irq_save(flags);
>         if (local_inc_return(&c->active) == 1) {
>                 llnode = __llist_del_first(&c->free_llist);
> @@ -815,6 +817,13 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>          */
>         local_irq_restore(flags);
>
> +       if (unlikely(!llnode && !retry)) {
> +               int cpu = smp_processor_id();
> +               alloc_bulk(c, 1, cpu_to_node(cpu), true);
cpu_to_node() is not necessary, we can just do

alloc_bulk(c, 1, NUMA_NO_NODE, true);

Also, maybe we can let alloc_bulk return int (0 or -ENOMEM).
For -ENOMEM, there is no need to goto retry_alloc.

Does this make sense?

Thanks,
Song

> +               retry = true;
> +               goto retry_alloc;
> +       }
> +
>         return llnode;
>  }
>
> --
> 2.48.1
>
Changwoo Min Feb. 13, 2025, 8:41 a.m. UTC | #2
Hello Song,

Thank you for the review!

On 25. 2. 13. 03:33, Song Liu wrote:
> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote:
>>
>> When there is no entry in the free list (c->free_llist), unit_alloc()
>> fails even when there is available memory in the system, causing allocation
>> failure in various BPF calls -- such as bpf_mem_alloc() and
>> bpf_cpumask_create().
>>
>> Such allocation failure can happen, especially when a BPF program tries many
>> allocations -- more than a delta between high and low watermarks -- in an
>> IRQ-disabled context.
> 
> Can we add a selftests for this scenario?

It would be a bit tricky to create an IRQ-disabled case in a BPF
program. However, I think it will be possible to reproduce the
allocation failure issue when allocating sufficiently enough
small allocations.

> 
>>
>> To address the problem, when there is no free entry, refill one entry on the
>> free list (alloc_bulk) and then retry the allocation procedure on the free
>> list. Note that since some callers of unit_alloc() do not allow to block
>> (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic
>> manner (atomic = true in alloc_bulk).
>>
>> Signed-off-by: Changwoo Min <changwoo@igalia.com>
>> ---
>>   kernel/bpf/memalloc.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 889374722d0a..22fe9cfb2b56 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -784,6 +784,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>          struct llist_node *llnode = NULL;
>>          unsigned long flags;
>>          int cnt = 0;
>> +       bool retry = false;
> 
> "retry = false;" reads weird to me. Maybe rename it as "retried"?

"retried" reads betters. Will fix it.

> 
>>
>>          /* Disable irqs to prevent the following race for majority of prog types:
>>           * prog_A
>> @@ -795,6 +796,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>           * Use per-cpu 'active' counter to order free_list access between
>>           * unit_alloc/unit_free/bpf_mem_refill.
>>           */
>> +retry_alloc:
>>          local_irq_save(flags);
>>          if (local_inc_return(&c->active) == 1) {
>>                  llnode = __llist_del_first(&c->free_llist);
>> @@ -815,6 +817,13 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>           */
>>          local_irq_restore(flags);
>>
>> +       if (unlikely(!llnode && !retry)) {
>> +               int cpu = smp_processor_id();
>> +               alloc_bulk(c, 1, cpu_to_node(cpu), true);
> cpu_to_node() is not necessary, we can just do
> 
> alloc_bulk(c, 1, NUMA_NO_NODE, true);

Sure, will change it as suggested.

> 
> Also, maybe we can let alloc_bulk return int (0 or -ENOMEM).
> For -ENOMEM, there is no need to goto retry_alloc.
> 
> Does this make sense?

Yup, I will change the alloc_bulk() as it returns 0 or -ENOMEM.

Regards,
Changwoo Min
Kumar Kartikeya Dwivedi Feb. 13, 2025, 9:05 a.m. UTC | #3
On Thu, 13 Feb 2025 at 09:42, Changwoo Min <changwoo@igalia.com> wrote:
>
> Hello Song,
>
> Thank you for the review!
>
> On 25. 2. 13. 03:33, Song Liu wrote:
> > On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote:
> >>
> >> When there is no entry in the free list (c->free_llist), unit_alloc()
> >> fails even when there is available memory in the system, causing allocation
> >> failure in various BPF calls -- such as bpf_mem_alloc() and
> >> bpf_cpumask_create().
> >>
> >> Such allocation failure can happen, especially when a BPF program tries many
> >> allocations -- more than a delta between high and low watermarks -- in an
> >> IRQ-disabled context.
> >
> > Can we add a selftests for this scenario?
>
> It would be a bit tricky to create an IRQ-disabled case in a BPF
> program. However, I think it will be possible to reproduce the
> allocation failure issue when allocating sufficiently enough
> small allocations.

You can also make use of recently introduced
bpf_local_irq_{save,restore} in the selftest.

>
> [...]
>
Changwoo Min Feb. 13, 2025, 10:11 a.m. UTC | #4
Hello Kumar,

On 25. 2. 13. 18:05, Kumar Kartikeya Dwivedi wrote:
> On Thu, 13 Feb 2025 at 09:42, Changwoo Min <changwoo@igalia.com> wrote:
>>
>> Hello Song,
>>
>> Thank you for the review!
>>
>> On 25. 2. 13. 03:33, Song Liu wrote:
>>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote:
>>>>
>>>> When there is no entry in the free list (c->free_llist), unit_alloc()
>>>> fails even when there is available memory in the system, causing allocation
>>>> failure in various BPF calls -- such as bpf_mem_alloc() and
>>>> bpf_cpumask_create().
>>>>
>>>> Such allocation failure can happen, especially when a BPF program tries many
>>>> allocations -- more than a delta between high and low watermarks -- in an
>>>> IRQ-disabled context.
>>>
>>> Can we add a selftests for this scenario?
>>
>> It would be a bit tricky to create an IRQ-disabled case in a BPF
>> program. However, I think it will be possible to reproduce the
>> allocation failure issue when allocating sufficiently enough
>> small allocations.
> 
> You can also make use of recently introduced
> bpf_local_irq_{save,restore} in the selftest.

Thank you for the suggestion. I will try it out.

Regards,
Changwoo Min
Alexei Starovoitov Feb. 13, 2025, 5:45 p.m. UTC | #5
On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote:
>
> (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic
> manner (atomic = true in alloc_bulk).

...
> +       if (unlikely(!llnode && !retry)) {
> +               int cpu = smp_processor_id();
> +               alloc_bulk(c, 1, cpu_to_node(cpu), true);

This is broken.
Passing atomic doesn't help.
unit_alloc() can be called from any context
including NMI/IRQ/kprobe deeply nested in slab internals.
kmalloc() is not safe from there.
The whole point of bpf_mem_alloc() is to be safe from
unknown context. If we could do kmalloc(GFP_NOWAIT)
everywhere bpf_mem_alloc() would be needed.

But we may do something.
Draining free_by_rcu_ttrace and waiting_for_gp_ttrace can be done,
but will it address your case?
The commit log is too terse to understand what exactly is going on.
Pls share the call stack. What is the allocation size?
How many do you do in a sequence?
Why irq-s are disabled? Isn't this for scx ?

pw-bot: cr
Changwoo Min Feb. 14, 2025, 9:23 a.m. UTC | #6
Hello Alexei,

Thank you for the comments! I reordered your comments for ease of 
explanation.

On 25. 2. 14. 02:45, Alexei Starovoitov wrote:
> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote:

> The commit log is too terse to understand what exactly is going on.
> Pls share the call stack. What is the allocation size?
> How many do you do in a sequence?

The symptom is that an scx scheduler (scx_lavd) fails to load on
an ARM64 platform on its first try. The second try succeeds. In
the failure case, the kernel spits the following messages:

[   27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error)
[   27.431396] sched_ext: lavd: ops.init() failed (-12)
[   27.431401]    scx_ops_enable.isra.0+0x838/0xe48
[   27.431413]    bpf_scx_reg+0x18/0x30
[   27.431418]    bpf_struct_ops_link_create+0x144/0x1a0
[   27.431427]    __sys_bpf+0x1560/0x1f98
[   27.431433]    __arm64_sys_bpf+0x2c/0x80
[   27.431439]    do_el0_svc+0x74/0x120
[   27.431446]    el0_svc+0x80/0xb0
[   27.431454]    el0t_64_sync_handler+0x120/0x138
[   27.431460]    el0t_64_sync+0x174/0x178

The ops.init() failed because the 5th bpf_cpumask_create() calls
failed during the initialization of the BPF scheduler. The exact
point where bpf_cpumask_create() failed is here [1]. That scx
scheduler allocates 5 CPU masks to aid its scheduling decision.

Also, it seems that there is no graceful way to handle the
allocation failure since it happens during the initialization of
the scx scheduler.

In my digging of the code, bpf_cpumask_create() relies on
bpf_mem_cache_alloc(), and bpf_mem_alloc_init() prefills only
4 entries per CPU (prefill_mem_cache), so the 5th allocation of
the cpumask failed.

Increasing the prefill entries would be a solution, but that
would cause unnecessary memory overhead in other cases, so
I avoided that approach.

> But we may do something.
> Draining free_by_rcu_ttrace and waiting_for_gp_ttrace can be done,
> but will it address your case?

Unfortunately, harvesting free_by_rcu_ttrace and
waiting_for_gp_ttrace does not help (I tested it). In my case,
the memory allocation fails when loading an scx scheduler, so
free_by_rcu_ttrace and waiting_for_gp_ttrace are empty, and there
is nothing to harvest.


> Why irq-s are disabled? Isn't this for scx ?

In this particular scenario, the IRQ is not disabled. I just
meant such allocation failure can happen easily with excessive
allocation when IRQ is disabled.

>> (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic
>> manner (atomic = true in alloc_bulk).
> 
> ...
>> +       if (unlikely(!llnode && !retry)) {
>> +               int cpu = smp_processor_id();
>> +               alloc_bulk(c, 1, cpu_to_node(cpu), true);
> 
> This is broken.
> Passing atomic doesn't help.
> unit_alloc() can be called from any context
> including NMI/IRQ/kprobe deeply nested in slab internals.
> kmalloc() is not safe from there.
> The whole point of bpf_mem_alloc() is to be safe from
> unknown context. If we could do kmalloc(GFP_NOWAIT)
> everywhere bpf_mem_alloc() would be needed.

I didn't think about the NMI case, where GFP_NOWAIT and GFP_ATOMIC are 
not safe.

Hmm.. maybe, we can extend 'bpf_mem_alloc_init()' or 'struct
bpf_mem_alloc' to specify the (initial) prefill count. This way
we can set a bit larger prefill count (say 8) for bpf cpumask.
What do you think?

[1] 
https://github.com/sched-ext/scx/blob/f17985cac0a60ba0136bbafa3f546db2b966cec0/scheds/rust/scx_lavd/src/bpf/main.bpf.c#L1970

Regards,
Changwoo Min
Alexei Starovoitov Feb. 15, 2025, 3:51 a.m. UTC | #7
On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com> wrote:
>
> Hello Alexei,
>
> Thank you for the comments! I reordered your comments for ease of
> explanation.
>
> On 25. 2. 14. 02:45, Alexei Starovoitov wrote:
> > On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote:
>
> > The commit log is too terse to understand what exactly is going on.
> > Pls share the call stack. What is the allocation size?
> > How many do you do in a sequence?
>
> The symptom is that an scx scheduler (scx_lavd) fails to load on
> an ARM64 platform on its first try. The second try succeeds. In
> the failure case, the kernel spits the following messages:
>
> [   27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error)
> [   27.431396] sched_ext: lavd: ops.init() failed (-12)
> [   27.431401]    scx_ops_enable.isra.0+0x838/0xe48
> [   27.431413]    bpf_scx_reg+0x18/0x30
> [   27.431418]    bpf_struct_ops_link_create+0x144/0x1a0
> [   27.431427]    __sys_bpf+0x1560/0x1f98
> [   27.431433]    __arm64_sys_bpf+0x2c/0x80
> [   27.431439]    do_el0_svc+0x74/0x120
> [   27.431446]    el0_svc+0x80/0xb0
> [   27.431454]    el0t_64_sync_handler+0x120/0x138
> [   27.431460]    el0t_64_sync+0x174/0x178
>
> The ops.init() failed because the 5th bpf_cpumask_create() calls
> failed during the initialization of the BPF scheduler. The exact
> point where bpf_cpumask_create() failed is here [1]. That scx
> scheduler allocates 5 CPU masks to aid its scheduling decision.

...

> In this particular scenario, the IRQ is not disabled. I just

since irq-s are not disabled the unit_alloc() should have done:
        if (cnt < c->low_watermark)
                irq_work_raise(c);

and alloc_bulk() should have started executing after the first
calloc_cpumask(&active_cpumask);
to refill it from 3 to 64.

What is sizeof(struct bpf_cpumask) in your system?

Something doesn't add up. irq_work_queue() should be
instant when irq-s are not disabled.
This is not IRQ_WORK_LAZY.
Are you running PREEMPT_RT ?
That would be the only explanation.
Changwoo Min Feb. 15, 2025, 3:16 p.m. UTC | #8
Hello,

On 25. 2. 15. 12:51, Alexei Starovoitov wrote:
 > On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com> wrote:
 >>
 >> Hello Alexei,
 >>
 >> Thank you for the comments! I reordered your comments for ease of
 >> explanation.
 >>
 >> On 25. 2. 14. 02:45, Alexei Starovoitov wrote:
 >>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> 
wrote:
 >>
 >>> The commit log is too terse to understand what exactly is going on.
 >>> Pls share the call stack. What is the allocation size?
 >>> How many do you do in a sequence?
 >>
 >> The symptom is that an scx scheduler (scx_lavd) fails to load on
 >> an ARM64 platform on its first try. The second try succeeds. In
 >> the failure case, the kernel spits the following messages:
 >>
 >> [   27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error)
 >> [   27.431396] sched_ext: lavd: ops.init() failed (-12)
 >> [   27.431401]    scx_ops_enable.isra.0+0x838/0xe48
 >> [   27.431413]    bpf_scx_reg+0x18/0x30
 >> [   27.431418]    bpf_struct_ops_link_create+0x144/0x1a0
 >> [   27.431427]    __sys_bpf+0x1560/0x1f98
 >> [   27.431433]    __arm64_sys_bpf+0x2c/0x80
 >> [   27.431439]    do_el0_svc+0x74/0x120
 >> [   27.431446]    el0_svc+0x80/0xb0
 >> [   27.431454]    el0t_64_sync_handler+0x120/0x138
 >> [   27.431460]    el0t_64_sync+0x174/0x178
 >>
 >> The ops.init() failed because the 5th bpf_cpumask_create() calls
 >> failed during the initialization of the BPF scheduler. The exact
 >> point where bpf_cpumask_create() failed is here [1]. That scx
 >> scheduler allocates 5 CPU masks to aid its scheduling decision.
 >
 > ...
 >
 >> In this particular scenario, the IRQ is not disabled. I just
 >
 > since irq-s are not disabled the unit_alloc() should have done:
 >          if (cnt < c->low_watermark)
 >                  irq_work_raise(c);
 >
 > and alloc_bulk() should have started executing after the first
 > calloc_cpumask(&active_cpumask);
 > to refill it from 3 to 64

Is there any possibility that irq_work is not scheduled right away on 
aarch64?

 >
 > What is sizeof(struct bpf_cpumask) in your system?

In my system, sizeof(struct bpf_cpumask) is 1032.

 >
 > Something doesn't add up. irq_work_queue() should be
 > instant when irq-s are not disabled.
 > This is not IRQ_WORK_LAZY.> Are you running PREEMPT_RT ?

No, CONFIG_PREEMPT_RT is not set.

Regards,
Changwoo Min
Changwoo Min Feb. 16, 2025, 4:04 p.m. UTC | #9
Hello,

 >  > What is sizeof(struct bpf_cpumask) in your system?
 >
 > In my system, sizeof(struct bpf_cpumask) is 1032.
It was a wrong number. sizeof(struct bpf_cpumask) is actually 16.

On 25. 2. 16. 00:16, Changwoo Min wrote:
> Hello,
> 
> On 25. 2. 15. 12:51, Alexei Starovoitov wrote:
>  > On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com> 
> wrote:
>  >>
>  >> Hello Alexei,
>  >>
>  >> Thank you for the comments! I reordered your comments for ease of
>  >> explanation.
>  >>
>  >> On 25. 2. 14. 02:45, Alexei Starovoitov wrote:
>  >>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> 
> wrote:
>  >>
>  >>> The commit log is too terse to understand what exactly is going on.
>  >>> Pls share the call stack. What is the allocation size?
>  >>> How many do you do in a sequence?
>  >>
>  >> The symptom is that an scx scheduler (scx_lavd) fails to load on
>  >> an ARM64 platform on its first try. The second try succeeds. In
>  >> the failure case, the kernel spits the following messages:
>  >>
>  >> [   27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error)
>  >> [   27.431396] sched_ext: lavd: ops.init() failed (-12)
>  >> [   27.431401]    scx_ops_enable.isra.0+0x838/0xe48
>  >> [   27.431413]    bpf_scx_reg+0x18/0x30
>  >> [   27.431418]    bpf_struct_ops_link_create+0x144/0x1a0
>  >> [   27.431427]    __sys_bpf+0x1560/0x1f98
>  >> [   27.431433]    __arm64_sys_bpf+0x2c/0x80
>  >> [   27.431439]    do_el0_svc+0x74/0x120
>  >> [   27.431446]    el0_svc+0x80/0xb0
>  >> [   27.431454]    el0t_64_sync_handler+0x120/0x138
>  >> [   27.431460]    el0t_64_sync+0x174/0x178
>  >>
>  >> The ops.init() failed because the 5th bpf_cpumask_create() calls
>  >> failed during the initialization of the BPF scheduler. The exact
>  >> point where bpf_cpumask_create() failed is here [1]. That scx
>  >> scheduler allocates 5 CPU masks to aid its scheduling decision.
>  >
>  > ...
>  >
>  >> In this particular scenario, the IRQ is not disabled. I just
>  >
>  > since irq-s are not disabled the unit_alloc() should have done:
>  >          if (cnt < c->low_watermark)
>  >                  irq_work_raise(c);
>  >
>  > and alloc_bulk() should have started executing after the first
>  > calloc_cpumask(&active_cpumask);
>  > to refill it from 3 to 64
> 
> Is there any possibility that irq_work is not scheduled right away on 
> aarch64?
> 
>  >
>  > What is sizeof(struct bpf_cpumask) in your system?
> 
> In my system, sizeof(struct bpf_cpumask) is 1032.
> 
>  >
>  > Something doesn't add up. irq_work_queue() should be
>  > instant when irq-s are not disabled.
>  > This is not IRQ_WORK_LAZY.> Are you running PREEMPT_RT ?
> 
> No, CONFIG_PREEMPT_RT is not set.
> 
> Regards,
> Changwoo Min
> 
>
Hou Tao Feb. 17, 2025, 2:19 a.m. UTC | #10
Hi,

On 2/17/2025 12:04 AM, Changwoo Min wrote:
> Hello,
>
> >  > What is sizeof(struct bpf_cpumask) in your system?
> >
> > In my system, sizeof(struct bpf_cpumask) is 1032.
> It was a wrong number. sizeof(struct bpf_cpumask) is actually 16.
>
> On 25. 2. 16. 00:16, Changwoo Min wrote:
>> Hello,
>>
>> On 25. 2. 15. 12:51, Alexei Starovoitov wrote:
>>  > On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com>
>> wrote:
>>  >>
>>  >> Hello Alexei,
>>  >>
>>  >> Thank you for the comments! I reordered your comments for ease of
>>  >> explanation.
>>  >>
>>  >> On 25. 2. 14. 02:45, Alexei Starovoitov wrote:
>>  >>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min
>> <changwoo@igalia.com> wrote:
>>  >>
>>  >>> The commit log is too terse to understand what exactly is going on.
>>  >>> Pls share the call stack. What is the allocation size?
>>  >>> How many do you do in a sequence?
>>  >>
>>  >> The symptom is that an scx scheduler (scx_lavd) fails to load on
>>  >> an ARM64 platform on its first try. The second try succeeds. In
>>  >> the failure case, the kernel spits the following messages:
>>  >>
>>  >> [   27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime
>> error)
>>  >> [   27.431396] sched_ext: lavd: ops.init() failed (-12)
>>  >> [   27.431401]    scx_ops_enable.isra.0+0x838/0xe48
>>  >> [   27.431413]    bpf_scx_reg+0x18/0x30
>>  >> [   27.431418]    bpf_struct_ops_link_create+0x144/0x1a0
>>  >> [   27.431427]    __sys_bpf+0x1560/0x1f98
>>  >> [   27.431433]    __arm64_sys_bpf+0x2c/0x80
>>  >> [   27.431439]    do_el0_svc+0x74/0x120
>>  >> [   27.431446]    el0_svc+0x80/0xb0
>>  >> [   27.431454]    el0t_64_sync_handler+0x120/0x138
>>  >> [   27.431460]    el0t_64_sync+0x174/0x178
>>  >>
>>  >> The ops.init() failed because the 5th bpf_cpumask_create() calls
>>  >> failed during the initialization of the BPF scheduler. The exact
>>  >> point where bpf_cpumask_create() failed is here [1]. That scx
>>  >> scheduler allocates 5 CPU masks to aid its scheduling decision.
>>  >
>>  > ...
>>  >
>>  >> In this particular scenario, the IRQ is not disabled. I just
>>  >
>>  > since irq-s are not disabled the unit_alloc() should have done:
>>  >          if (cnt < c->low_watermark)
>>  >                  irq_work_raise(c);
>>  >
>>  > and alloc_bulk() should have started executing after the first
>>  > calloc_cpumask(&active_cpumask);
>>  > to refill it from 3 to 64
>>
>> Is there any possibility that irq_work is not scheduled right away on
>> aarch64?

It is a IPI. I think its priority is higher than the current process.
>>
>>  >
>>  > What is sizeof(struct bpf_cpumask) in your system?
>>
>> In my system, sizeof(struct bpf_cpumask) is 1032.
>It was a wrong number. sizeof(struct bpf_cpumask) is actually 16.

It is indeed strange. The former guess is that bpf_cpumask may be
greater than 4K, so the refill in irq work may fail due to memory
fragment, but the allocation size is tiny.
>>
>>  >
>>  > Something doesn't add up. irq_work_queue() should be
>>  > instant when irq-s are not disabled.
>>  > This is not IRQ_WORK_LAZY.> Are you running PREEMPT_RT ?
>>
>> No, CONFIG_PREEMPT_RT is not set.

Could you please share the kernel .config file and the kernel version
for the problem ? And if you are running the test in a QEMU, please also
share the command line to run the qemu.
>>
>> Regards,
>> Changwoo Min
>>
>>
>
>
> .
diff mbox series

Patch

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 889374722d0a..22fe9cfb2b56 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -784,6 +784,7 @@  static void notrace *unit_alloc(struct bpf_mem_cache *c)
 	struct llist_node *llnode = NULL;
 	unsigned long flags;
 	int cnt = 0;
+	bool retry = false;
 
 	/* Disable irqs to prevent the following race for majority of prog types:
 	 * prog_A
@@ -795,6 +796,7 @@  static void notrace *unit_alloc(struct bpf_mem_cache *c)
 	 * Use per-cpu 'active' counter to order free_list access between
 	 * unit_alloc/unit_free/bpf_mem_refill.
 	 */
+retry_alloc:
 	local_irq_save(flags);
 	if (local_inc_return(&c->active) == 1) {
 		llnode = __llist_del_first(&c->free_llist);
@@ -815,6 +817,13 @@  static void notrace *unit_alloc(struct bpf_mem_cache *c)
 	 */
 	local_irq_restore(flags);
 
+	if (unlikely(!llnode && !retry)) {
+		int cpu = smp_processor_id();
+		alloc_bulk(c, 1, cpu_to_node(cpu), true);
+		retry = true;
+		goto retry_alloc;
+	}
+
 	return llnode;
 }