Message ID | 20250212084851.150169-1-changwoo@igalia.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf-next] bpf: Add a retry after refilling the free list when unit_alloc() fails | expand |
On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: > > When there is no entry in the free list (c->free_llist), unit_alloc() > fails even when there is available memory in the system, causing allocation > failure in various BPF calls -- such as bpf_mem_alloc() and > bpf_cpumask_create(). > > Such allocation failure can happen, especially when a BPF program tries many > allocations -- more than a delta between high and low watermarks -- in an > IRQ-disabled context. Can we add a selftests for this scenario? > > To address the problem, when there is no free entry, refill one entry on the > free list (alloc_bulk) and then retry the allocation procedure on the free > list. Note that since some callers of unit_alloc() do not allow to block > (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic > manner (atomic = true in alloc_bulk). > > Signed-off-by: Changwoo Min <changwoo@igalia.com> > --- > kernel/bpf/memalloc.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 889374722d0a..22fe9cfb2b56 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -784,6 +784,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > struct llist_node *llnode = NULL; > unsigned long flags; > int cnt = 0; > + bool retry = false; "retry = false;" reads weird to me. Maybe rename it as "retried"? > > /* Disable irqs to prevent the following race for majority of prog types: > * prog_A > @@ -795,6 +796,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > * Use per-cpu 'active' counter to order free_list access between > * unit_alloc/unit_free/bpf_mem_refill. > */ > +retry_alloc: > local_irq_save(flags); > if (local_inc_return(&c->active) == 1) { > llnode = __llist_del_first(&c->free_llist); > @@ -815,6 +817,13 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > */ > local_irq_restore(flags); > > + if (unlikely(!llnode && !retry)) { > + int cpu = smp_processor_id(); > + alloc_bulk(c, 1, cpu_to_node(cpu), true); cpu_to_node() is not necessary, we can just do alloc_bulk(c, 1, NUMA_NO_NODE, true); Also, maybe we can let alloc_bulk return int (0 or -ENOMEM). For -ENOMEM, there is no need to goto retry_alloc. Does this make sense? Thanks, Song > + retry = true; > + goto retry_alloc; > + } > + > return llnode; > } > > -- > 2.48.1 >
Hello Song, Thank you for the review! On 25. 2. 13. 03:33, Song Liu wrote: > On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: >> >> When there is no entry in the free list (c->free_llist), unit_alloc() >> fails even when there is available memory in the system, causing allocation >> failure in various BPF calls -- such as bpf_mem_alloc() and >> bpf_cpumask_create(). >> >> Such allocation failure can happen, especially when a BPF program tries many >> allocations -- more than a delta between high and low watermarks -- in an >> IRQ-disabled context. > > Can we add a selftests for this scenario? It would be a bit tricky to create an IRQ-disabled case in a BPF program. However, I think it will be possible to reproduce the allocation failure issue when allocating sufficiently enough small allocations. > >> >> To address the problem, when there is no free entry, refill one entry on the >> free list (alloc_bulk) and then retry the allocation procedure on the free >> list. Note that since some callers of unit_alloc() do not allow to block >> (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic >> manner (atomic = true in alloc_bulk). >> >> Signed-off-by: Changwoo Min <changwoo@igalia.com> >> --- >> kernel/bpf/memalloc.c | 9 +++++++++ >> 1 file changed, 9 insertions(+) >> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 889374722d0a..22fe9cfb2b56 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -784,6 +784,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >> struct llist_node *llnode = NULL; >> unsigned long flags; >> int cnt = 0; >> + bool retry = false; > > "retry = false;" reads weird to me. Maybe rename it as "retried"? "retried" reads betters. Will fix it. > >> >> /* Disable irqs to prevent the following race for majority of prog types: >> * prog_A >> @@ -795,6 +796,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >> * Use per-cpu 'active' counter to order free_list access between >> * unit_alloc/unit_free/bpf_mem_refill. >> */ >> +retry_alloc: >> local_irq_save(flags); >> if (local_inc_return(&c->active) == 1) { >> llnode = __llist_del_first(&c->free_llist); >> @@ -815,6 +817,13 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >> */ >> local_irq_restore(flags); >> >> + if (unlikely(!llnode && !retry)) { >> + int cpu = smp_processor_id(); >> + alloc_bulk(c, 1, cpu_to_node(cpu), true); > cpu_to_node() is not necessary, we can just do > > alloc_bulk(c, 1, NUMA_NO_NODE, true); Sure, will change it as suggested. > > Also, maybe we can let alloc_bulk return int (0 or -ENOMEM). > For -ENOMEM, there is no need to goto retry_alloc. > > Does this make sense? Yup, I will change the alloc_bulk() as it returns 0 or -ENOMEM. Regards, Changwoo Min
On Thu, 13 Feb 2025 at 09:42, Changwoo Min <changwoo@igalia.com> wrote: > > Hello Song, > > Thank you for the review! > > On 25. 2. 13. 03:33, Song Liu wrote: > > On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: > >> > >> When there is no entry in the free list (c->free_llist), unit_alloc() > >> fails even when there is available memory in the system, causing allocation > >> failure in various BPF calls -- such as bpf_mem_alloc() and > >> bpf_cpumask_create(). > >> > >> Such allocation failure can happen, especially when a BPF program tries many > >> allocations -- more than a delta between high and low watermarks -- in an > >> IRQ-disabled context. > > > > Can we add a selftests for this scenario? > > It would be a bit tricky to create an IRQ-disabled case in a BPF > program. However, I think it will be possible to reproduce the > allocation failure issue when allocating sufficiently enough > small allocations. You can also make use of recently introduced bpf_local_irq_{save,restore} in the selftest. > > [...] >
Hello Kumar, On 25. 2. 13. 18:05, Kumar Kartikeya Dwivedi wrote: > On Thu, 13 Feb 2025 at 09:42, Changwoo Min <changwoo@igalia.com> wrote: >> >> Hello Song, >> >> Thank you for the review! >> >> On 25. 2. 13. 03:33, Song Liu wrote: >>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: >>>> >>>> When there is no entry in the free list (c->free_llist), unit_alloc() >>>> fails even when there is available memory in the system, causing allocation >>>> failure in various BPF calls -- such as bpf_mem_alloc() and >>>> bpf_cpumask_create(). >>>> >>>> Such allocation failure can happen, especially when a BPF program tries many >>>> allocations -- more than a delta between high and low watermarks -- in an >>>> IRQ-disabled context. >>> >>> Can we add a selftests for this scenario? >> >> It would be a bit tricky to create an IRQ-disabled case in a BPF >> program. However, I think it will be possible to reproduce the >> allocation failure issue when allocating sufficiently enough >> small allocations. > > You can also make use of recently introduced > bpf_local_irq_{save,restore} in the selftest. Thank you for the suggestion. I will try it out. Regards, Changwoo Min
On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: > > (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic > manner (atomic = true in alloc_bulk). ... > + if (unlikely(!llnode && !retry)) { > + int cpu = smp_processor_id(); > + alloc_bulk(c, 1, cpu_to_node(cpu), true); This is broken. Passing atomic doesn't help. unit_alloc() can be called from any context including NMI/IRQ/kprobe deeply nested in slab internals. kmalloc() is not safe from there. The whole point of bpf_mem_alloc() is to be safe from unknown context. If we could do kmalloc(GFP_NOWAIT) everywhere bpf_mem_alloc() would be needed. But we may do something. Draining free_by_rcu_ttrace and waiting_for_gp_ttrace can be done, but will it address your case? The commit log is too terse to understand what exactly is going on. Pls share the call stack. What is the allocation size? How many do you do in a sequence? Why irq-s are disabled? Isn't this for scx ? pw-bot: cr
Hello Alexei, Thank you for the comments! I reordered your comments for ease of explanation. On 25. 2. 14. 02:45, Alexei Starovoitov wrote: > On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: > The commit log is too terse to understand what exactly is going on. > Pls share the call stack. What is the allocation size? > How many do you do in a sequence? The symptom is that an scx scheduler (scx_lavd) fails to load on an ARM64 platform on its first try. The second try succeeds. In the failure case, the kernel spits the following messages: [ 27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error) [ 27.431396] sched_ext: lavd: ops.init() failed (-12) [ 27.431401] scx_ops_enable.isra.0+0x838/0xe48 [ 27.431413] bpf_scx_reg+0x18/0x30 [ 27.431418] bpf_struct_ops_link_create+0x144/0x1a0 [ 27.431427] __sys_bpf+0x1560/0x1f98 [ 27.431433] __arm64_sys_bpf+0x2c/0x80 [ 27.431439] do_el0_svc+0x74/0x120 [ 27.431446] el0_svc+0x80/0xb0 [ 27.431454] el0t_64_sync_handler+0x120/0x138 [ 27.431460] el0t_64_sync+0x174/0x178 The ops.init() failed because the 5th bpf_cpumask_create() calls failed during the initialization of the BPF scheduler. The exact point where bpf_cpumask_create() failed is here [1]. That scx scheduler allocates 5 CPU masks to aid its scheduling decision. Also, it seems that there is no graceful way to handle the allocation failure since it happens during the initialization of the scx scheduler. In my digging of the code, bpf_cpumask_create() relies on bpf_mem_cache_alloc(), and bpf_mem_alloc_init() prefills only 4 entries per CPU (prefill_mem_cache), so the 5th allocation of the cpumask failed. Increasing the prefill entries would be a solution, but that would cause unnecessary memory overhead in other cases, so I avoided that approach. > But we may do something. > Draining free_by_rcu_ttrace and waiting_for_gp_ttrace can be done, > but will it address your case? Unfortunately, harvesting free_by_rcu_ttrace and waiting_for_gp_ttrace does not help (I tested it). In my case, the memory allocation fails when loading an scx scheduler, so free_by_rcu_ttrace and waiting_for_gp_ttrace are empty, and there is nothing to harvest. > Why irq-s are disabled? Isn't this for scx ? In this particular scenario, the IRQ is not disabled. I just meant such allocation failure can happen easily with excessive allocation when IRQ is disabled. >> (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic >> manner (atomic = true in alloc_bulk). > > ... >> + if (unlikely(!llnode && !retry)) { >> + int cpu = smp_processor_id(); >> + alloc_bulk(c, 1, cpu_to_node(cpu), true); > > This is broken. > Passing atomic doesn't help. > unit_alloc() can be called from any context > including NMI/IRQ/kprobe deeply nested in slab internals. > kmalloc() is not safe from there. > The whole point of bpf_mem_alloc() is to be safe from > unknown context. If we could do kmalloc(GFP_NOWAIT) > everywhere bpf_mem_alloc() would be needed. I didn't think about the NMI case, where GFP_NOWAIT and GFP_ATOMIC are not safe. Hmm.. maybe, we can extend 'bpf_mem_alloc_init()' or 'struct bpf_mem_alloc' to specify the (initial) prefill count. This way we can set a bit larger prefill count (say 8) for bpf cpumask. What do you think? [1] https://github.com/sched-ext/scx/blob/f17985cac0a60ba0136bbafa3f546db2b966cec0/scheds/rust/scx_lavd/src/bpf/main.bpf.c#L1970 Regards, Changwoo Min
On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com> wrote: > > Hello Alexei, > > Thank you for the comments! I reordered your comments for ease of > explanation. > > On 25. 2. 14. 02:45, Alexei Starovoitov wrote: > > On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: > > > The commit log is too terse to understand what exactly is going on. > > Pls share the call stack. What is the allocation size? > > How many do you do in a sequence? > > The symptom is that an scx scheduler (scx_lavd) fails to load on > an ARM64 platform on its first try. The second try succeeds. In > the failure case, the kernel spits the following messages: > > [ 27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error) > [ 27.431396] sched_ext: lavd: ops.init() failed (-12) > [ 27.431401] scx_ops_enable.isra.0+0x838/0xe48 > [ 27.431413] bpf_scx_reg+0x18/0x30 > [ 27.431418] bpf_struct_ops_link_create+0x144/0x1a0 > [ 27.431427] __sys_bpf+0x1560/0x1f98 > [ 27.431433] __arm64_sys_bpf+0x2c/0x80 > [ 27.431439] do_el0_svc+0x74/0x120 > [ 27.431446] el0_svc+0x80/0xb0 > [ 27.431454] el0t_64_sync_handler+0x120/0x138 > [ 27.431460] el0t_64_sync+0x174/0x178 > > The ops.init() failed because the 5th bpf_cpumask_create() calls > failed during the initialization of the BPF scheduler. The exact > point where bpf_cpumask_create() failed is here [1]. That scx > scheduler allocates 5 CPU masks to aid its scheduling decision. ... > In this particular scenario, the IRQ is not disabled. I just since irq-s are not disabled the unit_alloc() should have done: if (cnt < c->low_watermark) irq_work_raise(c); and alloc_bulk() should have started executing after the first calloc_cpumask(&active_cpumask); to refill it from 3 to 64. What is sizeof(struct bpf_cpumask) in your system? Something doesn't add up. irq_work_queue() should be instant when irq-s are not disabled. This is not IRQ_WORK_LAZY. Are you running PREEMPT_RT ? That would be the only explanation.
Hello, On 25. 2. 15. 12:51, Alexei Starovoitov wrote: > On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com> wrote: >> >> Hello Alexei, >> >> Thank you for the comments! I reordered your comments for ease of >> explanation. >> >> On 25. 2. 14. 02:45, Alexei Starovoitov wrote: >>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> wrote: >> >>> The commit log is too terse to understand what exactly is going on. >>> Pls share the call stack. What is the allocation size? >>> How many do you do in a sequence? >> >> The symptom is that an scx scheduler (scx_lavd) fails to load on >> an ARM64 platform on its first try. The second try succeeds. In >> the failure case, the kernel spits the following messages: >> >> [ 27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error) >> [ 27.431396] sched_ext: lavd: ops.init() failed (-12) >> [ 27.431401] scx_ops_enable.isra.0+0x838/0xe48 >> [ 27.431413] bpf_scx_reg+0x18/0x30 >> [ 27.431418] bpf_struct_ops_link_create+0x144/0x1a0 >> [ 27.431427] __sys_bpf+0x1560/0x1f98 >> [ 27.431433] __arm64_sys_bpf+0x2c/0x80 >> [ 27.431439] do_el0_svc+0x74/0x120 >> [ 27.431446] el0_svc+0x80/0xb0 >> [ 27.431454] el0t_64_sync_handler+0x120/0x138 >> [ 27.431460] el0t_64_sync+0x174/0x178 >> >> The ops.init() failed because the 5th bpf_cpumask_create() calls >> failed during the initialization of the BPF scheduler. The exact >> point where bpf_cpumask_create() failed is here [1]. That scx >> scheduler allocates 5 CPU masks to aid its scheduling decision. > > ... > >> In this particular scenario, the IRQ is not disabled. I just > > since irq-s are not disabled the unit_alloc() should have done: > if (cnt < c->low_watermark) > irq_work_raise(c); > > and alloc_bulk() should have started executing after the first > calloc_cpumask(&active_cpumask); > to refill it from 3 to 64 Is there any possibility that irq_work is not scheduled right away on aarch64? > > What is sizeof(struct bpf_cpumask) in your system? In my system, sizeof(struct bpf_cpumask) is 1032. > > Something doesn't add up. irq_work_queue() should be > instant when irq-s are not disabled. > This is not IRQ_WORK_LAZY.> Are you running PREEMPT_RT ? No, CONFIG_PREEMPT_RT is not set. Regards, Changwoo Min
Hello, > > What is sizeof(struct bpf_cpumask) in your system? > > In my system, sizeof(struct bpf_cpumask) is 1032. It was a wrong number. sizeof(struct bpf_cpumask) is actually 16. On 25. 2. 16. 00:16, Changwoo Min wrote: > Hello, > > On 25. 2. 15. 12:51, Alexei Starovoitov wrote: > > On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com> > wrote: > >> > >> Hello Alexei, > >> > >> Thank you for the comments! I reordered your comments for ease of > >> explanation. > >> > >> On 25. 2. 14. 02:45, Alexei Starovoitov wrote: > >>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@igalia.com> > wrote: > >> > >>> The commit log is too terse to understand what exactly is going on. > >>> Pls share the call stack. What is the allocation size? > >>> How many do you do in a sequence? > >> > >> The symptom is that an scx scheduler (scx_lavd) fails to load on > >> an ARM64 platform on its first try. The second try succeeds. In > >> the failure case, the kernel spits the following messages: > >> > >> [ 27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error) > >> [ 27.431396] sched_ext: lavd: ops.init() failed (-12) > >> [ 27.431401] scx_ops_enable.isra.0+0x838/0xe48 > >> [ 27.431413] bpf_scx_reg+0x18/0x30 > >> [ 27.431418] bpf_struct_ops_link_create+0x144/0x1a0 > >> [ 27.431427] __sys_bpf+0x1560/0x1f98 > >> [ 27.431433] __arm64_sys_bpf+0x2c/0x80 > >> [ 27.431439] do_el0_svc+0x74/0x120 > >> [ 27.431446] el0_svc+0x80/0xb0 > >> [ 27.431454] el0t_64_sync_handler+0x120/0x138 > >> [ 27.431460] el0t_64_sync+0x174/0x178 > >> > >> The ops.init() failed because the 5th bpf_cpumask_create() calls > >> failed during the initialization of the BPF scheduler. The exact > >> point where bpf_cpumask_create() failed is here [1]. That scx > >> scheduler allocates 5 CPU masks to aid its scheduling decision. > > > > ... > > > >> In this particular scenario, the IRQ is not disabled. I just > > > > since irq-s are not disabled the unit_alloc() should have done: > > if (cnt < c->low_watermark) > > irq_work_raise(c); > > > > and alloc_bulk() should have started executing after the first > > calloc_cpumask(&active_cpumask); > > to refill it from 3 to 64 > > Is there any possibility that irq_work is not scheduled right away on > aarch64? > > > > > What is sizeof(struct bpf_cpumask) in your system? > > In my system, sizeof(struct bpf_cpumask) is 1032. > > > > > Something doesn't add up. irq_work_queue() should be > > instant when irq-s are not disabled. > > This is not IRQ_WORK_LAZY.> Are you running PREEMPT_RT ? > > No, CONFIG_PREEMPT_RT is not set. > > Regards, > Changwoo Min > >
Hi, On 2/17/2025 12:04 AM, Changwoo Min wrote: > Hello, > > > > What is sizeof(struct bpf_cpumask) in your system? > > > > In my system, sizeof(struct bpf_cpumask) is 1032. > It was a wrong number. sizeof(struct bpf_cpumask) is actually 16. > > On 25. 2. 16. 00:16, Changwoo Min wrote: >> Hello, >> >> On 25. 2. 15. 12:51, Alexei Starovoitov wrote: >> > On Fri, Feb 14, 2025 at 1:24 AM Changwoo Min <changwoo@igalia.com> >> wrote: >> >> >> >> Hello Alexei, >> >> >> >> Thank you for the comments! I reordered your comments for ease of >> >> explanation. >> >> >> >> On 25. 2. 14. 02:45, Alexei Starovoitov wrote: >> >>> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min >> <changwoo@igalia.com> wrote: >> >> >> >>> The commit log is too terse to understand what exactly is going on. >> >>> Pls share the call stack. What is the allocation size? >> >>> How many do you do in a sequence? >> >> >> >> The symptom is that an scx scheduler (scx_lavd) fails to load on >> >> an ARM64 platform on its first try. The second try succeeds. In >> >> the failure case, the kernel spits the following messages: >> >> >> >> [ 27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime >> error) >> >> [ 27.431396] sched_ext: lavd: ops.init() failed (-12) >> >> [ 27.431401] scx_ops_enable.isra.0+0x838/0xe48 >> >> [ 27.431413] bpf_scx_reg+0x18/0x30 >> >> [ 27.431418] bpf_struct_ops_link_create+0x144/0x1a0 >> >> [ 27.431427] __sys_bpf+0x1560/0x1f98 >> >> [ 27.431433] __arm64_sys_bpf+0x2c/0x80 >> >> [ 27.431439] do_el0_svc+0x74/0x120 >> >> [ 27.431446] el0_svc+0x80/0xb0 >> >> [ 27.431454] el0t_64_sync_handler+0x120/0x138 >> >> [ 27.431460] el0t_64_sync+0x174/0x178 >> >> >> >> The ops.init() failed because the 5th bpf_cpumask_create() calls >> >> failed during the initialization of the BPF scheduler. The exact >> >> point where bpf_cpumask_create() failed is here [1]. That scx >> >> scheduler allocates 5 CPU masks to aid its scheduling decision. >> > >> > ... >> > >> >> In this particular scenario, the IRQ is not disabled. I just >> > >> > since irq-s are not disabled the unit_alloc() should have done: >> > if (cnt < c->low_watermark) >> > irq_work_raise(c); >> > >> > and alloc_bulk() should have started executing after the first >> > calloc_cpumask(&active_cpumask); >> > to refill it from 3 to 64 >> >> Is there any possibility that irq_work is not scheduled right away on >> aarch64? It is a IPI. I think its priority is higher than the current process. >> >> > >> > What is sizeof(struct bpf_cpumask) in your system? >> >> In my system, sizeof(struct bpf_cpumask) is 1032. >It was a wrong number. sizeof(struct bpf_cpumask) is actually 16. It is indeed strange. The former guess is that bpf_cpumask may be greater than 4K, so the refill in irq work may fail due to memory fragment, but the allocation size is tiny. >> >> > >> > Something doesn't add up. irq_work_queue() should be >> > instant when irq-s are not disabled. >> > This is not IRQ_WORK_LAZY.> Are you running PREEMPT_RT ? >> >> No, CONFIG_PREEMPT_RT is not set. Could you please share the kernel .config file and the kernel version for the problem ? And if you are running the test in a QEMU, please also share the command line to run the qemu. >> >> Regards, >> Changwoo Min >> >> > > > .
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 889374722d0a..22fe9cfb2b56 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -784,6 +784,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) struct llist_node *llnode = NULL; unsigned long flags; int cnt = 0; + bool retry = false; /* Disable irqs to prevent the following race for majority of prog types: * prog_A @@ -795,6 +796,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) * Use per-cpu 'active' counter to order free_list access between * unit_alloc/unit_free/bpf_mem_refill. */ +retry_alloc: local_irq_save(flags); if (local_inc_return(&c->active) == 1) { llnode = __llist_del_first(&c->free_llist); @@ -815,6 +817,13 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) */ local_irq_restore(flags); + if (unlikely(!llnode && !retry)) { + int cpu = smp_processor_id(); + alloc_bulk(c, 1, cpu_to_node(cpu), true); + retry = true; + goto retry_alloc; + } + return llnode; }
When there is no entry in the free list (c->free_llist), unit_alloc() fails even when there is available memory in the system, causing allocation failure in various BPF calls -- such as bpf_mem_alloc() and bpf_cpumask_create(). Such allocation failure can happen, especially when a BPF program tries many allocations -- more than a delta between high and low watermarks -- in an IRQ-disabled context. To address the problem, when there is no free entry, refill one entry on the free list (alloc_bulk) and then retry the allocation procedure on the free list. Note that since some callers of unit_alloc() do not allow to block (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic manner (atomic = true in alloc_bulk). Signed-off-by: Changwoo Min <changwoo@igalia.com> --- kernel/bpf/memalloc.c | 9 +++++++++ 1 file changed, 9 insertions(+)