diff mbox series

[v2,bpf-next] bpf: Avoid unnecessary -EBUSY from htab_lock_bucket

Message ID 20231004004350.533234-1-song@kernel.org (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [v2,bpf-next] bpf: Avoid unnecessary -EBUSY from htab_lock_bucket | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-30 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_verifier on x86_64 with llvm-16
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1352 this patch: 1352
netdev/cc_maintainers warning 7 maintainers not CCed: martin.lau@linux.dev jolsa@kernel.org haoluo@google.com sdf@google.com john.fastabend@gmail.com yonghong.song@linux.dev kpsingh@kernel.org
netdev/build_clang success Errors and warnings before: 1364 this patch: 1364
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1375 this patch: 1375
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 26 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 fail Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 fail Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on s390x with gcc

Commit Message

Song Liu Oct. 4, 2023, 12:43 a.m. UTC
htab_lock_bucket uses the following logic to avoid recursion:

1. preempt_disable();
2. check percpu counter htab->map_locked[hash] for recursion;
   2.1. if map_lock[hash] is already taken, return -BUSY;
3. raw_spin_lock_irqsave();

However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
logic will not able to access the same hash of the hashtab and get -EBUSY.
This -EBUSY is not really necessary. Fix it by disabling IRQ before
checking map_locked:

1. preempt_disable();
2. local_irq_save();
3. check percpu counter htab->map_locked[hash] for recursion;
   3.1. if map_lock[hash] is already taken, return -BUSY;
4. raw_spin_lock().

Similarly, use raw_spin_unlock() and local_irq_restore() in
htab_unlock_bucket().

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>

---
Changes in v2:
1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket().
   (Andrii)
---
 kernel/bpf/hashtab.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Andrii Nakryiko Oct. 4, 2023, 3:08 a.m. UTC | #1
On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote:
>
> htab_lock_bucket uses the following logic to avoid recursion:
>
> 1. preempt_disable();
> 2. check percpu counter htab->map_locked[hash] for recursion;
>    2.1. if map_lock[hash] is already taken, return -BUSY;
> 3. raw_spin_lock_irqsave();
>
> However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
> logic will not able to access the same hash of the hashtab and get -EBUSY.
> This -EBUSY is not really necessary. Fix it by disabling IRQ before
> checking map_locked:
>
> 1. preempt_disable();
> 2. local_irq_save();
> 3. check percpu counter htab->map_locked[hash] for recursion;
>    3.1. if map_lock[hash] is already taken, return -BUSY;
> 4. raw_spin_lock().
>
> Similarly, use raw_spin_unlock() and local_irq_restore() in
> htab_unlock_bucket().
>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Song Liu <song@kernel.org>
>
> ---
> Changes in v2:
> 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket().
>    (Andrii)
> ---
>  kernel/bpf/hashtab.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>

Now it's more symmetrical and seems correct to me, thanks!

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index a8c7e1c5abfa..fd8d4b0addfc 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>         hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
>
>         preempt_disable();
> +       local_irq_save(flags);
>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>                 __this_cpu_dec(*(htab->map_locked[hash]));
> +               local_irq_restore(flags);
>                 preempt_enable();
>                 return -EBUSY;
>         }
>
> -       raw_spin_lock_irqsave(&b->raw_lock, flags);
> +       raw_spin_lock(&b->raw_lock);
>         *pflags = flags;
>
>         return 0;
> @@ -172,8 +174,9 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>                                       unsigned long flags)
>  {
>         hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
> -       raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> +       raw_spin_unlock(&b->raw_lock);
>         __this_cpu_dec(*(htab->map_locked[hash]));
> +       local_irq_restore(flags);
>         preempt_enable();
>  }
>
> --
> 2.34.1
>
Alexei Starovoitov Oct. 4, 2023, 3:33 a.m. UTC | #2
On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote:
> >
> > htab_lock_bucket uses the following logic to avoid recursion:
> >
> > 1. preempt_disable();
> > 2. check percpu counter htab->map_locked[hash] for recursion;
> >    2.1. if map_lock[hash] is already taken, return -BUSY;
> > 3. raw_spin_lock_irqsave();
> >
> > However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
> > logic will not able to access the same hash of the hashtab and get -EBUSY.
> > This -EBUSY is not really necessary. Fix it by disabling IRQ before
> > checking map_locked:
> >
> > 1. preempt_disable();
> > 2. local_irq_save();
> > 3. check percpu counter htab->map_locked[hash] for recursion;
> >    3.1. if map_lock[hash] is already taken, return -BUSY;
> > 4. raw_spin_lock().
> >
> > Similarly, use raw_spin_unlock() and local_irq_restore() in
> > htab_unlock_bucket().
> >
> > Suggested-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Song Liu <song@kernel.org>
> >
> > ---
> > Changes in v2:
> > 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket().
> >    (Andrii)
> > ---
> >  kernel/bpf/hashtab.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
>
> Now it's more symmetrical and seems correct to me, thanks!
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index a8c7e1c5abfa..fd8d4b0addfc 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> >         hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
> >
> >         preempt_disable();
> > +       local_irq_save(flags);
> >         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> >                 __this_cpu_dec(*(htab->map_locked[hash]));
> > +               local_irq_restore(flags);
> >                 preempt_enable();
> >                 return -EBUSY;
> >         }
> >
> > -       raw_spin_lock_irqsave(&b->raw_lock, flags);
> > +       raw_spin_lock(&b->raw_lock);

Song,

take a look at s390 crash in BPF CI.
I suspect this patch is causing it.

Ilya,

do you have an idea what is going on?
Song Liu Oct. 4, 2023, 4:18 p.m. UTC | #3
> On Oct 3, 2023, at 8:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> 
>> On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote:
>>> 
>>> htab_lock_bucket uses the following logic to avoid recursion:
>>> 
>>> 1. preempt_disable();
>>> 2. check percpu counter htab->map_locked[hash] for recursion;
>>>   2.1. if map_lock[hash] is already taken, return -BUSY;
>>> 3. raw_spin_lock_irqsave();
>>> 
>>> However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
>>> logic will not able to access the same hash of the hashtab and get -EBUSY.
>>> This -EBUSY is not really necessary. Fix it by disabling IRQ before
>>> checking map_locked:
>>> 
>>> 1. preempt_disable();
>>> 2. local_irq_save();
>>> 3. check percpu counter htab->map_locked[hash] for recursion;
>>>   3.1. if map_lock[hash] is already taken, return -BUSY;
>>> 4. raw_spin_lock().
>>> 
>>> Similarly, use raw_spin_unlock() and local_irq_restore() in
>>> htab_unlock_bucket().
>>> 
>>> Suggested-by: Tejun Heo <tj@kernel.org>
>>> Signed-off-by: Song Liu <song@kernel.org>
>>> 
>>> ---
>>> Changes in v2:
>>> 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket().
>>>   (Andrii)
>>> ---
>>> kernel/bpf/hashtab.c | 7 +++++--
>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>> 
>> 
>> Now it's more symmetrical and seems correct to me, thanks!
>> 
>> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> 
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index a8c7e1c5abfa..fd8d4b0addfc 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>        hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
>>> 
>>>        preempt_disable();
>>> +       local_irq_save(flags);
>>>        if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>                __this_cpu_dec(*(htab->map_locked[hash]));
>>> +               local_irq_restore(flags);
>>>                preempt_enable();
>>>                return -EBUSY;
>>>        }
>>> 
>>> -       raw_spin_lock_irqsave(&b->raw_lock, flags);
>>> +       raw_spin_lock(&b->raw_lock);
> 
> Song,
> 
> take a look at s390 crash in BPF CI.
> I suspect this patch is causing it.

It indeed looks like triggered by this patch. But I haven't figured
out why it happens. v1 seems ok for the same tests. 

Song

> 
> Ilya,
> 
> do you have an idea what is going on?
Song Liu Oct. 5, 2023, 12:11 a.m. UTC | #4
> On Oct 4, 2023, at 9:18 AM, Song Liu <songliubraving@meta.com> wrote:
> 
> 
> 
>> On Oct 3, 2023, at 8:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>> 
>> On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko
>> <andrii.nakryiko@gmail.com> wrote:
>>> 
>>> On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote:
>>>> 
>>>> htab_lock_bucket uses the following logic to avoid recursion:
>>>> 
>>>> 1. preempt_disable();
>>>> 2. check percpu counter htab->map_locked[hash] for recursion;
>>>>  2.1. if map_lock[hash] is already taken, return -BUSY;
>>>> 3. raw_spin_lock_irqsave();
>>>> 
>>>> However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
>>>> logic will not able to access the same hash of the hashtab and get -EBUSY.
>>>> This -EBUSY is not really necessary. Fix it by disabling IRQ before
>>>> checking map_locked:
>>>> 
>>>> 1. preempt_disable();
>>>> 2. local_irq_save();
>>>> 3. check percpu counter htab->map_locked[hash] for recursion;
>>>>  3.1. if map_lock[hash] is already taken, return -BUSY;
>>>> 4. raw_spin_lock().
>>>> 
>>>> Similarly, use raw_spin_unlock() and local_irq_restore() in
>>>> htab_unlock_bucket().
>>>> 
>>>> Suggested-by: Tejun Heo <tj@kernel.org>
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> 
>>>> ---
>>>> Changes in v2:
>>>> 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket().
>>>>  (Andrii)
>>>> ---
>>>> kernel/bpf/hashtab.c | 7 +++++--
>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>> 
>>> 
>>> Now it's more symmetrical and seems correct to me, thanks!
>>> 
>>> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>>> 
>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>> index a8c7e1c5abfa..fd8d4b0addfc 100644
>>>> --- a/kernel/bpf/hashtab.c
>>>> +++ b/kernel/bpf/hashtab.c
>>>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>       hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
>>>> 
>>>>       preempt_disable();
>>>> +       local_irq_save(flags);
>>>>       if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>               __this_cpu_dec(*(htab->map_locked[hash]));
>>>> +               local_irq_restore(flags);
>>>>               preempt_enable();
>>>>               return -EBUSY;
>>>>       }
>>>> 
>>>> -       raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>> +       raw_spin_lock(&b->raw_lock);
>> 
>> Song,
>> 
>> take a look at s390 crash in BPF CI.
>> I suspect this patch is causing it.
> 
> It indeed looks like triggered by this patch. But I haven't figured
> out why it happens. v1 seems ok for the same tests. 

I guess I finally figured out this (should be simple) bug. If I got it 
correctly, we need:

diff --git c/kernel/bpf/hashtab.c w/kernel/bpf/hashtab.c
index fd8d4b0addfc..1cfa2329a53a 100644
--- c/kernel/bpf/hashtab.c
+++ w/kernel/bpf/hashtab.c
@@ -160,6 +160,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
                __this_cpu_dec(*(htab->map_locked[hash]));
                local_irq_restore(flags);
                preempt_enable();
+               *pflags = flags;
                return -EBUSY;
        }


Running CI tests here:

https://github.com/kernel-patches/bpf/pull/5769

If it works, I will send v3. 

Thanks,
Song

PS: s390x CI is running slow. I got some jobs stayed in the queue 
for more than a hour. 

> 
> Song
> 
>> 
>> Ilya,
>> 
>> do you have an idea what is going on?
>
Song Liu Oct. 5, 2023, 1:50 a.m. UTC | #5
> On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com> wrote:
> 
> 
> 
>> On Oct 4, 2023, at 9:18 AM, Song Liu <songliubraving@meta.com> wrote:
>> 
>> 
>> 
>>> On Oct 3, 2023, at 8:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> 
>>> On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>> 
>>>> On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote:
>>>>> 
>>>>> htab_lock_bucket uses the following logic to avoid recursion:
>>>>> 
>>>>> 1. preempt_disable();
>>>>> 2. check percpu counter htab->map_locked[hash] for recursion;
>>>>> 2.1. if map_lock[hash] is already taken, return -BUSY;
>>>>> 3. raw_spin_lock_irqsave();
>>>>> 
>>>>> However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
>>>>> logic will not able to access the same hash of the hashtab and get -EBUSY.
>>>>> This -EBUSY is not really necessary. Fix it by disabling IRQ before
>>>>> checking map_locked:
>>>>> 
>>>>> 1. preempt_disable();
>>>>> 2. local_irq_save();
>>>>> 3. check percpu counter htab->map_locked[hash] for recursion;
>>>>> 3.1. if map_lock[hash] is already taken, return -BUSY;
>>>>> 4. raw_spin_lock().
>>>>> 
>>>>> Similarly, use raw_spin_unlock() and local_irq_restore() in
>>>>> htab_unlock_bucket().
>>>>> 
>>>>> Suggested-by: Tejun Heo <tj@kernel.org>
>>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>>> 
>>>>> ---
>>>>> Changes in v2:
>>>>> 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket().
>>>>> (Andrii)
>>>>> ---
>>>>> kernel/bpf/hashtab.c | 7 +++++--
>>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>>> 
>>>> 
>>>> Now it's more symmetrical and seems correct to me, thanks!
>>>> 
>>>> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>>>> 
>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>> index a8c7e1c5abfa..fd8d4b0addfc 100644
>>>>> --- a/kernel/bpf/hashtab.c
>>>>> +++ b/kernel/bpf/hashtab.c
>>>>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>>      hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
>>>>> 
>>>>>      preempt_disable();
>>>>> +       local_irq_save(flags);
>>>>>      if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>>              __this_cpu_dec(*(htab->map_locked[hash]));
>>>>> +               local_irq_restore(flags);
>>>>>              preempt_enable();
>>>>>              return -EBUSY;
>>>>>      }
>>>>> 
>>>>> -       raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>>> +       raw_spin_lock(&b->raw_lock);
>>> 
>>> Song,
>>> 
>>> take a look at s390 crash in BPF CI.
>>> I suspect this patch is causing it.
>> 
>> It indeed looks like triggered by this patch. But I haven't figured
>> out why it happens. v1 seems ok for the same tests. 
> 
> I guess I finally figured out this (should be simple) bug. If I got it 
> correctly, we need:
> 
> diff --git c/kernel/bpf/hashtab.c w/kernel/bpf/hashtab.c
> index fd8d4b0addfc..1cfa2329a53a 100644
> --- c/kernel/bpf/hashtab.c
> +++ w/kernel/bpf/hashtab.c
> @@ -160,6 +160,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>                __this_cpu_dec(*(htab->map_locked[hash]));
>                local_irq_restore(flags);
>                preempt_enable();
> +               *pflags = flags;
>                return -EBUSY;
>        }

No... I was totally wrong. This is not needed. 

Trying something different..

Thanks,
Song
Song Liu Oct. 6, 2023, 1:15 a.m. UTC | #6
> On Oct 4, 2023, at 6:50 PM, Song Liu <songliubraving@meta.com> wrote:
>> On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com> wrote:

[...]

>>>>>
>>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>>> index a8c7e1c5abfa..fd8d4b0addfc 100644
>>>>>> --- a/kernel/bpf/hashtab.c
>>>>>> +++ b/kernel/bpf/hashtab.c
>>>>>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>>>     hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
>>>>>>
>>>>>>     preempt_disable();
>>>>>> +       local_irq_save(flags);
>>>>>>     if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>>>             __this_cpu_dec(*(htab->map_locked[hash]));
>>>>>> +               local_irq_restore(flags);
>>>>>>             preempt_enable();
>>>>>>             return -EBUSY;
>>>>>>     }
>>>>>>
>>>>>> -       raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>>>> +       raw_spin_lock(&b->raw_lock);
>>>>
>>>> Song,
>>>>
>>>> take a look at s390 crash in BPF CI.
>>>> I suspect this patch is causing it.
>>>
>>> It indeed looks like triggered by this patch. But I haven't figured
>>> out why it happens. v1 seems ok for the same tests.

Update my findings today:

I tried to reproduce the issue locally with qemu on my server (x86_64).
I got the following artifacts:

1. bzImage and selftests from CI: (need to login to GitHub)
https://github.com/kernel-patches/bpf/suites/16885416280/artifacts/964765766

2. cross compiler:
https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/13.2.0/x86_64-gcc-13.2.0-nolibc-s390-linux.tar.gz

3. root image:
https://libbpf-ci.s3.us-west-1.amazonaws.com/libbpf-vmtest-rootfs-2022.10.23-bullseye-s390x.tar.zst

With bzImage compiled in CI, I can reproduce the issue with qemu.
However, if I compile the kernel locally with the cross compiler
(with .config from CI, and then olddefconfig), the issue cannot
be reproduced. I have attached the two .config files here. They
look very similar, except the compiler version:

-CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0"
+CONFIG_CC_VERSION_TEXT="s390-linux-gcc (GCC) 13.2.0"

So far, I still think v2 is the right patch. But I really cannot
explain the issue on the bzImage from the CI. I cannot make much
sense out of the s390 assembly code either. (TIL: gdb on my x86
server can disassem s390 binary).

Ilya,

Could you please take a look at this?

Thanks,
Song

PS: the root image from the CI is not easy to use. Hopefully you
have something better than that.
Ilya Leoshkevich Oct. 9, 2023, 2:58 p.m. UTC | #7
On Fri, 2023-10-06 at 01:15 +0000, Song Liu wrote:
> > On Oct 4, 2023, at 6:50 PM, Song Liu <songliubraving@meta.com>
> > wrote:
> > > On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com>
> > > wrote:
> 
> [...]
> 
> > > > > > 
> > > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > > > > > > index a8c7e1c5abfa..fd8d4b0addfc 100644
> > > > > > > --- a/kernel/bpf/hashtab.c
> > > > > > > +++ b/kernel/bpf/hashtab.c
> > > > > > > @@ -155,13 +155,15 @@ static inline int
> > > > > > > htab_lock_bucket(const struct bpf_htab *htab,
> > > > > > >      hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK,
> > > > > > > htab->n_buckets - 1);
> > > > > > > 
> > > > > > >      preempt_disable();
> > > > > > > +       local_irq_save(flags);
> > > > > > >      if (unlikely(__this_cpu_inc_return(*(htab-
> > > > > > > >map_locked[hash])) != 1)) {
> > > > > > >              __this_cpu_dec(*(htab->map_locked[hash]));
> > > > > > > +               local_irq_restore(flags);
> > > > > > >              preempt_enable();
> > > > > > >              return -EBUSY;
> > > > > > >      }
> > > > > > > 
> > > > > > > -       raw_spin_lock_irqsave(&b->raw_lock, flags);
> > > > > > > +       raw_spin_lock(&b->raw_lock);
> > > > > 
> > > > > Song,
> > > > > 
> > > > > take a look at s390 crash in BPF CI.
> > > > > I suspect this patch is causing it.
> > > > 
> > > > It indeed looks like triggered by this patch. But I haven't
> > > > figured
> > > > out why it happens. v1 seems ok for the same tests. 
> 
> Update my findings today:
> 
> I tried to reproduce the issue locally with qemu on my server
> (x86_64). 
> I got the following artifacts:
> 
> 1. bzImage and selftests from CI: (need to login to GitHub)
> https://github.com/kernel-patches/bpf/suites/16885416280/artifacts/964765766
> 
> 2. cross compiler:
> https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/13.2.0/x86_64-gcc-13.2.0-nolibc-s390-linux.tar.gz
> 
> 3. root image:
> https://libbpf-ci.s3.us-west-1.amazonaws.com/libbpf-vmtest-rootfs-2022.10.23-bullseye-s390x.tar.zst
> 
> With bzImage compiled in CI, I can reproduce the issue with qemu. 
> However, if I compile the kernel locally with the cross compiler
> (with .config from CI, and then olddefconfig), the issue cannot 
> be reproduced. I have attached the two .config files here. They 
> look very similar, except the compiler version:
> 
> -CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0"
> +CONFIG_CC_VERSION_TEXT="s390-linux-gcc (GCC) 13.2.0"
> 
> So far, I still think v2 is the right patch. But I really cannot
> explain the issue on the bzImage from the CI. I cannot make much 
> sense out of the s390 assembly code either. (TIL: gdb on my x86
> server can disassem s390 binary). 
> 
> Ilya, 
> 
> Could you please take a look at this?
> 
> Thanks,
> Song
> 
> PS: the root image from the CI is not easy to use. Hopefully you
> have something better than that.

Hi,

sorry for the long silence, I was traveling and then sick.
I'll look into this ASAP.

Best regards,
Ilya
Ilya Leoshkevich Oct. 10, 2023, 8:48 p.m. UTC | #8
On Fri, 2023-10-06 at 01:15 +0000, Song Liu wrote:
> > On Oct 4, 2023, at 6:50 PM, Song Liu <songliubraving@meta.com>
> > wrote:
> > > On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com>
> > > wrote:
> 
> [...]
> 
> > > > > > 
> > > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > > > > > > index a8c7e1c5abfa..fd8d4b0addfc 100644
> > > > > > > --- a/kernel/bpf/hashtab.c
> > > > > > > +++ b/kernel/bpf/hashtab.c
> > > > > > > @@ -155,13 +155,15 @@ static inline int
> > > > > > > htab_lock_bucket(const struct bpf_htab *htab,
> > > > > > >      hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK,
> > > > > > > htab->n_buckets - 1);
> > > > > > > 
> > > > > > >      preempt_disable();
> > > > > > > +       local_irq_save(flags);
> > > > > > >      if (unlikely(__this_cpu_inc_return(*(htab-
> > > > > > > >map_locked[hash])) != 1)) {
> > > > > > >              __this_cpu_dec(*(htab->map_locked[hash]));
> > > > > > > +               local_irq_restore(flags);
> > > > > > >              preempt_enable();
> > > > > > >              return -EBUSY;
> > > > > > >      }
> > > > > > > 
> > > > > > > -       raw_spin_lock_irqsave(&b->raw_lock, flags);
> > > > > > > +       raw_spin_lock(&b->raw_lock);
> > > > > 
> > > > > Song,
> > > > > 
> > > > > take a look at s390 crash in BPF CI.
> > > > > I suspect this patch is causing it.
> > > > 
> > > > It indeed looks like triggered by this patch. But I haven't
> > > > figured
> > > > out why it happens. v1 seems ok for the same tests. 
> 
> Update my findings today:
> 
> I tried to reproduce the issue locally with qemu on my server
> (x86_64). 
> I got the following artifacts:
> 
> 1. bzImage and selftests from CI: (need to login to GitHub)
> https://github.com/kernel-patches/bpf/suites/16885416280/artifacts/964765766
> 
> 2. cross compiler:
> https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/13.2.0/x86_64-gcc-13.2.0-nolibc-s390-linux.tar.gz
> 
> 3. root image:
> https://libbpf-ci.s3.us-west-1.amazonaws.com/libbpf-vmtest-rootfs-2022.10.23-bullseye-s390x.tar.zst
> 
> With bzImage compiled in CI, I can reproduce the issue with qemu. 
> However, if I compile the kernel locally with the cross compiler
> (with .config from CI, and then olddefconfig), the issue cannot 
> be reproduced. I have attached the two .config files here. They 
> look very similar, except the compiler version:
> 
> -CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0"
> +CONFIG_CC_VERSION_TEXT="s390-linux-gcc (GCC) 13.2.0"
> 
> So far, I still think v2 is the right patch. But I really cannot
> explain the issue on the bzImage from the CI. I cannot make much 
> sense out of the s390 assembly code either. (TIL: gdb on my x86
> server can disassem s390 binary). 
> 
> Ilya, 
> 
> Could you please take a look at this?
> 
> Thanks,
> Song
> 
> PS: the root image from the CI is not easy to use. Hopefully you
> have something better than that.

Hi,

Thanks for posting the analysis and links to the artifacts, that saved
me quite some time. The crash is caused by a backchain issue in the
trampoline code and has nothing to do with your patch; I've posted the
fix here [1].

The difference between compilers is most likely caused by different
inlining decisions around lookup_elem_raw(). When it's inlined, the
test becomes a no-op.

Regarding GDB, Debian and Ubuntu have gdb-multiarch package. On Fedora
one has to build it from source; the magic binutils-gdb configure flag
is --enable-targets=all.

Regarding the development environment, in this case I've sidestepped
the need for a proper image by putting everything into initramfs:

workdir=$(mktemp -d)
tar -C "$workdir" -xf libbpf-vmtest-rootfs-2022.10.23-bullseye-
s390x.tar.zst
rsync -a selftests "$workdir"
(cd "$workdir" && find . | cpio -o --format=newc -R root:root)
>initrd.cpio
qemu-system-s390x -accel kvm -smp 2 -m 4G -kernel kbuild-
output/arch/s390/boot/bzImage -nographic -append 'nokaslr console=ttyS1
rdinit=/bin/sh' -initrd initrd.cpio -s

For the regular development I have a different setup, with a
non-minimal Debian install inside the guest, and the testcases mounted
from the host using 9p.

Best regards,
Ilya

[1]
https://lore.kernel.org/bpf/20231010203512.385819-1-iii@linux.ibm.com/
Song Liu Oct. 10, 2023, 10:13 p.m. UTC | #9
Hi Ilya,

On Tue, Oct 10, 2023 at 1:49 PM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>
[...]
> >
> > Thanks,
> > Song
> >
> > PS: the root image from the CI is not easy to use. Hopefully you
> > have something better than that.
>
> Hi,
>
> Thanks for posting the analysis and links to the artifacts, that saved
> me quite some time. The crash is caused by a backchain issue in the
> trampoline code and has nothing to do with your patch; I've posted the
> fix here [1].

Thanks for the fix!

Song

>
> The difference between compilers is most likely caused by different
> inlining decisions around lookup_elem_raw(). When it's inlined, the
> test becomes a no-op.
>
> Regarding GDB, Debian and Ubuntu have gdb-multiarch package. On Fedora
> one has to build it from source; the magic binutils-gdb configure flag
> is --enable-targets=all.
>
> Regarding the development environment, in this case I've sidestepped
> the need for a proper image by putting everything into initramfs:
>
> workdir=$(mktemp -d)
> tar -C "$workdir" -xf libbpf-vmtest-rootfs-2022.10.23-bullseye-
> s390x.tar.zst
> rsync -a selftests "$workdir"
> (cd "$workdir" && find . | cpio -o --format=newc -R root:root)
> >initrd.cpio
> qemu-system-s390x -accel kvm -smp 2 -m 4G -kernel kbuild-
> output/arch/s390/boot/bzImage -nographic -append 'nokaslr console=ttyS1
> rdinit=/bin/sh' -initrd initrd.cpio -s

Nice trick!

> For the regular development I have a different setup, with a
> non-minimal Debian install inside the guest, and the testcases mounted
> from the host using 9p.
>
> Best regards,
> Ilya
>
> [1]
> https://lore.kernel.org/bpf/20231010203512.385819-1-iii@linux.ibm.com/
Andrii Nakryiko Oct. 12, 2023, 12:42 a.m. UTC | #10
On Tue, Oct 10, 2023 at 3:13 PM Song Liu <song@kernel.org> wrote:
>
> Hi Ilya,
>
> On Tue, Oct 10, 2023 at 1:49 PM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
> >
> [...]
> > >
> > > Thanks,
> > > Song
> > >
> > > PS: the root image from the CI is not easy to use. Hopefully you
> > > have something better than that.
> >
> > Hi,
> >
> > Thanks for posting the analysis and links to the artifacts, that saved
> > me quite some time. The crash is caused by a backchain issue in the
> > trampoline code and has nothing to do with your patch; I've posted the
> > fix here [1].
>
> Thanks for the fix!
>
> Song

Song, can you please resend your patch so that CI can test it again on
top of Ilya's changes? Thanks!

>
> >
> > The difference between compilers is most likely caused by different
> > inlining decisions around lookup_elem_raw(). When it's inlined, the
> > test becomes a no-op.
> >
> > Regarding GDB, Debian and Ubuntu have gdb-multiarch package. On Fedora
> > one has to build it from source; the magic binutils-gdb configure flag
> > is --enable-targets=all.
> >
> > Regarding the development environment, in this case I've sidestepped
> > the need for a proper image by putting everything into initramfs:
> >
> > workdir=$(mktemp -d)
> > tar -C "$workdir" -xf libbpf-vmtest-rootfs-2022.10.23-bullseye-
> > s390x.tar.zst
> > rsync -a selftests "$workdir"
> > (cd "$workdir" && find . | cpio -o --format=newc -R root:root)
> > >initrd.cpio
> > qemu-system-s390x -accel kvm -smp 2 -m 4G -kernel kbuild-
> > output/arch/s390/boot/bzImage -nographic -append 'nokaslr console=ttyS1
> > rdinit=/bin/sh' -initrd initrd.cpio -s
>
> Nice trick!
>
> > For the regular development I have a different setup, with a
> > non-minimal Debian install inside the guest, and the testcases mounted
> > from the host using 9p.
> >
> > Best regards,
> > Ilya
> >
> > [1]
> > https://lore.kernel.org/bpf/20231010203512.385819-1-iii@linux.ibm.com/
Song Liu Oct. 12, 2023, 6:01 a.m. UTC | #11
> On Oct 11, 2023, at 5:42 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Tue, Oct 10, 2023 at 3:13 PM Song Liu <song@kernel.org> wrote:
>> 
>> Hi Ilya,
>> 
>> On Tue, Oct 10, 2023 at 1:49 PM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>>> 
>> [...]
>>>> 
>>>> Thanks,
>>>> Song
>>>> 
>>>> PS: the root image from the CI is not easy to use. Hopefully you
>>>> have something better than that.
>>> 
>>> Hi,
>>> 
>>> Thanks for posting the analysis and links to the artifacts, that saved
>>> me quite some time. The crash is caused by a backchain issue in the
>>> trampoline code and has nothing to do with your patch; I've posted the
>>> fix here [1].
>> 
>> Thanks for the fix!
>> 
>> Song
> 
> Song, can you please resend your patch so that CI can test it again on
> top of Ilya's changes? Thanks!

Sure. Just sent the patch to bpf tree. 

Thanks,
Song
diff mbox series

Patch

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index a8c7e1c5abfa..fd8d4b0addfc 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -155,13 +155,15 @@  static inline int htab_lock_bucket(const struct bpf_htab *htab,
 	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
 
 	preempt_disable();
+	local_irq_save(flags);
 	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
 		__this_cpu_dec(*(htab->map_locked[hash]));
+		local_irq_restore(flags);
 		preempt_enable();
 		return -EBUSY;
 	}
 
-	raw_spin_lock_irqsave(&b->raw_lock, flags);
+	raw_spin_lock(&b->raw_lock);
 	*pflags = flags;
 
 	return 0;
@@ -172,8 +174,9 @@  static inline void htab_unlock_bucket(const struct bpf_htab *htab,
 				      unsigned long flags)
 {
 	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
-	raw_spin_unlock_irqrestore(&b->raw_lock, flags);
+	raw_spin_unlock(&b->raw_lock);
 	__this_cpu_dec(*(htab->map_locked[hash]));
+	local_irq_restore(flags);
 	preempt_enable();
 }