Message ID | 20231004004350.533234-1-song@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | [v2,bpf-next] bpf: Avoid unnecessary -EBUSY from htab_lock_bucket | expand |
On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote: > > htab_lock_bucket uses the following logic to avoid recursion: > > 1. preempt_disable(); > 2. check percpu counter htab->map_locked[hash] for recursion; > 2.1. if map_lock[hash] is already taken, return -BUSY; > 3. raw_spin_lock_irqsave(); > > However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ > logic will not able to access the same hash of the hashtab and get -EBUSY. > This -EBUSY is not really necessary. Fix it by disabling IRQ before > checking map_locked: > > 1. preempt_disable(); > 2. local_irq_save(); > 3. check percpu counter htab->map_locked[hash] for recursion; > 3.1. if map_lock[hash] is already taken, return -BUSY; > 4. raw_spin_lock(). > > Similarly, use raw_spin_unlock() and local_irq_restore() in > htab_unlock_bucket(). > > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Song Liu <song@kernel.org> > > --- > Changes in v2: > 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). > (Andrii) > --- > kernel/bpf/hashtab.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > Now it's more symmetrical and seems correct to me, thanks! Acked-by: Andrii Nakryiko <andrii@kernel.org> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index a8c7e1c5abfa..fd8d4b0addfc 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); > > preempt_disable(); > + local_irq_save(flags); > if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > __this_cpu_dec(*(htab->map_locked[hash])); > + local_irq_restore(flags); > preempt_enable(); > return -EBUSY; > } > > - raw_spin_lock_irqsave(&b->raw_lock, flags); > + raw_spin_lock(&b->raw_lock); > *pflags = flags; > > return 0; > @@ -172,8 +174,9 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, > unsigned long flags) > { > hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); > - raw_spin_unlock_irqrestore(&b->raw_lock, flags); > + raw_spin_unlock(&b->raw_lock); > __this_cpu_dec(*(htab->map_locked[hash])); > + local_irq_restore(flags); > preempt_enable(); > } > > -- > 2.34.1 >
On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote: > > > > htab_lock_bucket uses the following logic to avoid recursion: > > > > 1. preempt_disable(); > > 2. check percpu counter htab->map_locked[hash] for recursion; > > 2.1. if map_lock[hash] is already taken, return -BUSY; > > 3. raw_spin_lock_irqsave(); > > > > However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ > > logic will not able to access the same hash of the hashtab and get -EBUSY. > > This -EBUSY is not really necessary. Fix it by disabling IRQ before > > checking map_locked: > > > > 1. preempt_disable(); > > 2. local_irq_save(); > > 3. check percpu counter htab->map_locked[hash] for recursion; > > 3.1. if map_lock[hash] is already taken, return -BUSY; > > 4. raw_spin_lock(). > > > > Similarly, use raw_spin_unlock() and local_irq_restore() in > > htab_unlock_bucket(). > > > > Suggested-by: Tejun Heo <tj@kernel.org> > > Signed-off-by: Song Liu <song@kernel.org> > > > > --- > > Changes in v2: > > 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). > > (Andrii) > > --- > > kernel/bpf/hashtab.c | 7 +++++-- > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > Now it's more symmetrical and seems correct to me, thanks! > > Acked-by: Andrii Nakryiko <andrii@kernel.org> > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > index a8c7e1c5abfa..fd8d4b0addfc 100644 > > --- a/kernel/bpf/hashtab.c > > +++ b/kernel/bpf/hashtab.c > > @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > > hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); > > > > preempt_disable(); > > + local_irq_save(flags); > > if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > > __this_cpu_dec(*(htab->map_locked[hash])); > > + local_irq_restore(flags); > > preempt_enable(); > > return -EBUSY; > > } > > > > - raw_spin_lock_irqsave(&b->raw_lock, flags); > > + raw_spin_lock(&b->raw_lock); Song, take a look at s390 crash in BPF CI. I suspect this patch is causing it. Ilya, do you have an idea what is going on?
> On Oct 3, 2023, at 8:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: >> >> On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote: >>> >>> htab_lock_bucket uses the following logic to avoid recursion: >>> >>> 1. preempt_disable(); >>> 2. check percpu counter htab->map_locked[hash] for recursion; >>> 2.1. if map_lock[hash] is already taken, return -BUSY; >>> 3. raw_spin_lock_irqsave(); >>> >>> However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ >>> logic will not able to access the same hash of the hashtab and get -EBUSY. >>> This -EBUSY is not really necessary. Fix it by disabling IRQ before >>> checking map_locked: >>> >>> 1. preempt_disable(); >>> 2. local_irq_save(); >>> 3. check percpu counter htab->map_locked[hash] for recursion; >>> 3.1. if map_lock[hash] is already taken, return -BUSY; >>> 4. raw_spin_lock(). >>> >>> Similarly, use raw_spin_unlock() and local_irq_restore() in >>> htab_unlock_bucket(). >>> >>> Suggested-by: Tejun Heo <tj@kernel.org> >>> Signed-off-by: Song Liu <song@kernel.org> >>> >>> --- >>> Changes in v2: >>> 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). >>> (Andrii) >>> --- >>> kernel/bpf/hashtab.c | 7 +++++-- >>> 1 file changed, 5 insertions(+), 2 deletions(-) >>> >> >> Now it's more symmetrical and seems correct to me, thanks! >> >> Acked-by: Andrii Nakryiko <andrii@kernel.org> >> >>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>> index a8c7e1c5abfa..fd8d4b0addfc 100644 >>> --- a/kernel/bpf/hashtab.c >>> +++ b/kernel/bpf/hashtab.c >>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >>> hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); >>> >>> preempt_disable(); >>> + local_irq_save(flags); >>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >>> __this_cpu_dec(*(htab->map_locked[hash])); >>> + local_irq_restore(flags); >>> preempt_enable(); >>> return -EBUSY; >>> } >>> >>> - raw_spin_lock_irqsave(&b->raw_lock, flags); >>> + raw_spin_lock(&b->raw_lock); > > Song, > > take a look at s390 crash in BPF CI. > I suspect this patch is causing it. It indeed looks like triggered by this patch. But I haven't figured out why it happens. v1 seems ok for the same tests. Song > > Ilya, > > do you have an idea what is going on?
> On Oct 4, 2023, at 9:18 AM, Song Liu <songliubraving@meta.com> wrote: > > > >> On Oct 3, 2023, at 8:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: >> >> On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko >> <andrii.nakryiko@gmail.com> wrote: >>> >>> On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote: >>>> >>>> htab_lock_bucket uses the following logic to avoid recursion: >>>> >>>> 1. preempt_disable(); >>>> 2. check percpu counter htab->map_locked[hash] for recursion; >>>> 2.1. if map_lock[hash] is already taken, return -BUSY; >>>> 3. raw_spin_lock_irqsave(); >>>> >>>> However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ >>>> logic will not able to access the same hash of the hashtab and get -EBUSY. >>>> This -EBUSY is not really necessary. Fix it by disabling IRQ before >>>> checking map_locked: >>>> >>>> 1. preempt_disable(); >>>> 2. local_irq_save(); >>>> 3. check percpu counter htab->map_locked[hash] for recursion; >>>> 3.1. if map_lock[hash] is already taken, return -BUSY; >>>> 4. raw_spin_lock(). >>>> >>>> Similarly, use raw_spin_unlock() and local_irq_restore() in >>>> htab_unlock_bucket(). >>>> >>>> Suggested-by: Tejun Heo <tj@kernel.org> >>>> Signed-off-by: Song Liu <song@kernel.org> >>>> >>>> --- >>>> Changes in v2: >>>> 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). >>>> (Andrii) >>>> --- >>>> kernel/bpf/hashtab.c | 7 +++++-- >>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>> >>> >>> Now it's more symmetrical and seems correct to me, thanks! >>> >>> Acked-by: Andrii Nakryiko <andrii@kernel.org> >>> >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>>> index a8c7e1c5abfa..fd8d4b0addfc 100644 >>>> --- a/kernel/bpf/hashtab.c >>>> +++ b/kernel/bpf/hashtab.c >>>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >>>> hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); >>>> >>>> preempt_disable(); >>>> + local_irq_save(flags); >>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>> + local_irq_restore(flags); >>>> preempt_enable(); >>>> return -EBUSY; >>>> } >>>> >>>> - raw_spin_lock_irqsave(&b->raw_lock, flags); >>>> + raw_spin_lock(&b->raw_lock); >> >> Song, >> >> take a look at s390 crash in BPF CI. >> I suspect this patch is causing it. > > It indeed looks like triggered by this patch. But I haven't figured > out why it happens. v1 seems ok for the same tests. I guess I finally figured out this (should be simple) bug. If I got it correctly, we need: diff --git c/kernel/bpf/hashtab.c w/kernel/bpf/hashtab.c index fd8d4b0addfc..1cfa2329a53a 100644 --- c/kernel/bpf/hashtab.c +++ w/kernel/bpf/hashtab.c @@ -160,6 +160,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, __this_cpu_dec(*(htab->map_locked[hash])); local_irq_restore(flags); preempt_enable(); + *pflags = flags; return -EBUSY; } Running CI tests here: https://github.com/kernel-patches/bpf/pull/5769 If it works, I will send v3. Thanks, Song PS: s390x CI is running slow. I got some jobs stayed in the queue for more than a hour. > > Song > >> >> Ilya, >> >> do you have an idea what is going on? >
> On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com> wrote: > > > >> On Oct 4, 2023, at 9:18 AM, Song Liu <songliubraving@meta.com> wrote: >> >> >> >>> On Oct 3, 2023, at 8:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: >>> >>> On Tue, Oct 3, 2023 at 8:08 PM Andrii Nakryiko >>> <andrii.nakryiko@gmail.com> wrote: >>>> >>>> On Tue, Oct 3, 2023 at 5:45 PM Song Liu <song@kernel.org> wrote: >>>>> >>>>> htab_lock_bucket uses the following logic to avoid recursion: >>>>> >>>>> 1. preempt_disable(); >>>>> 2. check percpu counter htab->map_locked[hash] for recursion; >>>>> 2.1. if map_lock[hash] is already taken, return -BUSY; >>>>> 3. raw_spin_lock_irqsave(); >>>>> >>>>> However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ >>>>> logic will not able to access the same hash of the hashtab and get -EBUSY. >>>>> This -EBUSY is not really necessary. Fix it by disabling IRQ before >>>>> checking map_locked: >>>>> >>>>> 1. preempt_disable(); >>>>> 2. local_irq_save(); >>>>> 3. check percpu counter htab->map_locked[hash] for recursion; >>>>> 3.1. if map_lock[hash] is already taken, return -BUSY; >>>>> 4. raw_spin_lock(). >>>>> >>>>> Similarly, use raw_spin_unlock() and local_irq_restore() in >>>>> htab_unlock_bucket(). >>>>> >>>>> Suggested-by: Tejun Heo <tj@kernel.org> >>>>> Signed-off-by: Song Liu <song@kernel.org> >>>>> >>>>> --- >>>>> Changes in v2: >>>>> 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). >>>>> (Andrii) >>>>> --- >>>>> kernel/bpf/hashtab.c | 7 +++++-- >>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>> >>>> >>>> Now it's more symmetrical and seems correct to me, thanks! >>>> >>>> Acked-by: Andrii Nakryiko <andrii@kernel.org> >>>> >>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>>>> index a8c7e1c5abfa..fd8d4b0addfc 100644 >>>>> --- a/kernel/bpf/hashtab.c >>>>> +++ b/kernel/bpf/hashtab.c >>>>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >>>>> hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); >>>>> >>>>> preempt_disable(); >>>>> + local_irq_save(flags); >>>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >>>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>>> + local_irq_restore(flags); >>>>> preempt_enable(); >>>>> return -EBUSY; >>>>> } >>>>> >>>>> - raw_spin_lock_irqsave(&b->raw_lock, flags); >>>>> + raw_spin_lock(&b->raw_lock); >>> >>> Song, >>> >>> take a look at s390 crash in BPF CI. >>> I suspect this patch is causing it. >> >> It indeed looks like triggered by this patch. But I haven't figured >> out why it happens. v1 seems ok for the same tests. > > I guess I finally figured out this (should be simple) bug. If I got it > correctly, we need: > > diff --git c/kernel/bpf/hashtab.c w/kernel/bpf/hashtab.c > index fd8d4b0addfc..1cfa2329a53a 100644 > --- c/kernel/bpf/hashtab.c > +++ w/kernel/bpf/hashtab.c > @@ -160,6 +160,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > __this_cpu_dec(*(htab->map_locked[hash])); > local_irq_restore(flags); > preempt_enable(); > + *pflags = flags; > return -EBUSY; > } No... I was totally wrong. This is not needed. Trying something different.. Thanks, Song
> On Oct 4, 2023, at 6:50 PM, Song Liu <songliubraving@meta.com> wrote: >> On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com> wrote: [...] >>>>> >>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>>>>> index a8c7e1c5abfa..fd8d4b0addfc 100644 >>>>>> --- a/kernel/bpf/hashtab.c >>>>>> +++ b/kernel/bpf/hashtab.c >>>>>> @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >>>>>> hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); >>>>>> >>>>>> preempt_disable(); >>>>>> + local_irq_save(flags); >>>>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >>>>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>>>> + local_irq_restore(flags); >>>>>> preempt_enable(); >>>>>> return -EBUSY; >>>>>> } >>>>>> >>>>>> - raw_spin_lock_irqsave(&b->raw_lock, flags); >>>>>> + raw_spin_lock(&b->raw_lock); >>>> >>>> Song, >>>> >>>> take a look at s390 crash in BPF CI. >>>> I suspect this patch is causing it. >>> >>> It indeed looks like triggered by this patch. But I haven't figured >>> out why it happens. v1 seems ok for the same tests. Update my findings today: I tried to reproduce the issue locally with qemu on my server (x86_64). I got the following artifacts: 1. bzImage and selftests from CI: (need to login to GitHub) https://github.com/kernel-patches/bpf/suites/16885416280/artifacts/964765766 2. cross compiler: https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/13.2.0/x86_64-gcc-13.2.0-nolibc-s390-linux.tar.gz 3. root image: https://libbpf-ci.s3.us-west-1.amazonaws.com/libbpf-vmtest-rootfs-2022.10.23-bullseye-s390x.tar.zst With bzImage compiled in CI, I can reproduce the issue with qemu. However, if I compile the kernel locally with the cross compiler (with .config from CI, and then olddefconfig), the issue cannot be reproduced. I have attached the two .config files here. They look very similar, except the compiler version: -CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0" +CONFIG_CC_VERSION_TEXT="s390-linux-gcc (GCC) 13.2.0" So far, I still think v2 is the right patch. But I really cannot explain the issue on the bzImage from the CI. I cannot make much sense out of the s390 assembly code either. (TIL: gdb on my x86 server can disassem s390 binary). Ilya, Could you please take a look at this? Thanks, Song PS: the root image from the CI is not easy to use. Hopefully you have something better than that.
On Fri, 2023-10-06 at 01:15 +0000, Song Liu wrote: > > On Oct 4, 2023, at 6:50 PM, Song Liu <songliubraving@meta.com> > > wrote: > > > On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com> > > > wrote: > > [...] > > > > > > > > > > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > > > > > > index a8c7e1c5abfa..fd8d4b0addfc 100644 > > > > > > > --- a/kernel/bpf/hashtab.c > > > > > > > +++ b/kernel/bpf/hashtab.c > > > > > > > @@ -155,13 +155,15 @@ static inline int > > > > > > > htab_lock_bucket(const struct bpf_htab *htab, > > > > > > > hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, > > > > > > > htab->n_buckets - 1); > > > > > > > > > > > > > > preempt_disable(); > > > > > > > + local_irq_save(flags); > > > > > > > if (unlikely(__this_cpu_inc_return(*(htab- > > > > > > > >map_locked[hash])) != 1)) { > > > > > > > __this_cpu_dec(*(htab->map_locked[hash])); > > > > > > > + local_irq_restore(flags); > > > > > > > preempt_enable(); > > > > > > > return -EBUSY; > > > > > > > } > > > > > > > > > > > > > > - raw_spin_lock_irqsave(&b->raw_lock, flags); > > > > > > > + raw_spin_lock(&b->raw_lock); > > > > > > > > > > Song, > > > > > > > > > > take a look at s390 crash in BPF CI. > > > > > I suspect this patch is causing it. > > > > > > > > It indeed looks like triggered by this patch. But I haven't > > > > figured > > > > out why it happens. v1 seems ok for the same tests. > > Update my findings today: > > I tried to reproduce the issue locally with qemu on my server > (x86_64). > I got the following artifacts: > > 1. bzImage and selftests from CI: (need to login to GitHub) > https://github.com/kernel-patches/bpf/suites/16885416280/artifacts/964765766 > > 2. cross compiler: > https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/13.2.0/x86_64-gcc-13.2.0-nolibc-s390-linux.tar.gz > > 3. root image: > https://libbpf-ci.s3.us-west-1.amazonaws.com/libbpf-vmtest-rootfs-2022.10.23-bullseye-s390x.tar.zst > > With bzImage compiled in CI, I can reproduce the issue with qemu. > However, if I compile the kernel locally with the cross compiler > (with .config from CI, and then olddefconfig), the issue cannot > be reproduced. I have attached the two .config files here. They > look very similar, except the compiler version: > > -CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0" > +CONFIG_CC_VERSION_TEXT="s390-linux-gcc (GCC) 13.2.0" > > So far, I still think v2 is the right patch. But I really cannot > explain the issue on the bzImage from the CI. I cannot make much > sense out of the s390 assembly code either. (TIL: gdb on my x86 > server can disassem s390 binary). > > Ilya, > > Could you please take a look at this? > > Thanks, > Song > > PS: the root image from the CI is not easy to use. Hopefully you > have something better than that. Hi, sorry for the long silence, I was traveling and then sick. I'll look into this ASAP. Best regards, Ilya
On Fri, 2023-10-06 at 01:15 +0000, Song Liu wrote: > > On Oct 4, 2023, at 6:50 PM, Song Liu <songliubraving@meta.com> > > wrote: > > > On Oct 4, 2023, at 5:11 PM, Song Liu <songliubraving@meta.com> > > > wrote: > > [...] > > > > > > > > > > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > > > > > > index a8c7e1c5abfa..fd8d4b0addfc 100644 > > > > > > > --- a/kernel/bpf/hashtab.c > > > > > > > +++ b/kernel/bpf/hashtab.c > > > > > > > @@ -155,13 +155,15 @@ static inline int > > > > > > > htab_lock_bucket(const struct bpf_htab *htab, > > > > > > > hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, > > > > > > > htab->n_buckets - 1); > > > > > > > > > > > > > > preempt_disable(); > > > > > > > + local_irq_save(flags); > > > > > > > if (unlikely(__this_cpu_inc_return(*(htab- > > > > > > > >map_locked[hash])) != 1)) { > > > > > > > __this_cpu_dec(*(htab->map_locked[hash])); > > > > > > > + local_irq_restore(flags); > > > > > > > preempt_enable(); > > > > > > > return -EBUSY; > > > > > > > } > > > > > > > > > > > > > > - raw_spin_lock_irqsave(&b->raw_lock, flags); > > > > > > > + raw_spin_lock(&b->raw_lock); > > > > > > > > > > Song, > > > > > > > > > > take a look at s390 crash in BPF CI. > > > > > I suspect this patch is causing it. > > > > > > > > It indeed looks like triggered by this patch. But I haven't > > > > figured > > > > out why it happens. v1 seems ok for the same tests. > > Update my findings today: > > I tried to reproduce the issue locally with qemu on my server > (x86_64). > I got the following artifacts: > > 1. bzImage and selftests from CI: (need to login to GitHub) > https://github.com/kernel-patches/bpf/suites/16885416280/artifacts/964765766 > > 2. cross compiler: > https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/13.2.0/x86_64-gcc-13.2.0-nolibc-s390-linux.tar.gz > > 3. root image: > https://libbpf-ci.s3.us-west-1.amazonaws.com/libbpf-vmtest-rootfs-2022.10.23-bullseye-s390x.tar.zst > > With bzImage compiled in CI, I can reproduce the issue with qemu. > However, if I compile the kernel locally with the cross compiler > (with .config from CI, and then olddefconfig), the issue cannot > be reproduced. I have attached the two .config files here. They > look very similar, except the compiler version: > > -CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0" > +CONFIG_CC_VERSION_TEXT="s390-linux-gcc (GCC) 13.2.0" > > So far, I still think v2 is the right patch. But I really cannot > explain the issue on the bzImage from the CI. I cannot make much > sense out of the s390 assembly code either. (TIL: gdb on my x86 > server can disassem s390 binary). > > Ilya, > > Could you please take a look at this? > > Thanks, > Song > > PS: the root image from the CI is not easy to use. Hopefully you > have something better than that. Hi, Thanks for posting the analysis and links to the artifacts, that saved me quite some time. The crash is caused by a backchain issue in the trampoline code and has nothing to do with your patch; I've posted the fix here [1]. The difference between compilers is most likely caused by different inlining decisions around lookup_elem_raw(). When it's inlined, the test becomes a no-op. Regarding GDB, Debian and Ubuntu have gdb-multiarch package. On Fedora one has to build it from source; the magic binutils-gdb configure flag is --enable-targets=all. Regarding the development environment, in this case I've sidestepped the need for a proper image by putting everything into initramfs: workdir=$(mktemp -d) tar -C "$workdir" -xf libbpf-vmtest-rootfs-2022.10.23-bullseye- s390x.tar.zst rsync -a selftests "$workdir" (cd "$workdir" && find . | cpio -o --format=newc -R root:root) >initrd.cpio qemu-system-s390x -accel kvm -smp 2 -m 4G -kernel kbuild- output/arch/s390/boot/bzImage -nographic -append 'nokaslr console=ttyS1 rdinit=/bin/sh' -initrd initrd.cpio -s For the regular development I have a different setup, with a non-minimal Debian install inside the guest, and the testcases mounted from the host using 9p. Best regards, Ilya [1] https://lore.kernel.org/bpf/20231010203512.385819-1-iii@linux.ibm.com/
Hi Ilya, On Tue, Oct 10, 2023 at 1:49 PM Ilya Leoshkevich <iii@linux.ibm.com> wrote: > [...] > > > > Thanks, > > Song > > > > PS: the root image from the CI is not easy to use. Hopefully you > > have something better than that. > > Hi, > > Thanks for posting the analysis and links to the artifacts, that saved > me quite some time. The crash is caused by a backchain issue in the > trampoline code and has nothing to do with your patch; I've posted the > fix here [1]. Thanks for the fix! Song > > The difference between compilers is most likely caused by different > inlining decisions around lookup_elem_raw(). When it's inlined, the > test becomes a no-op. > > Regarding GDB, Debian and Ubuntu have gdb-multiarch package. On Fedora > one has to build it from source; the magic binutils-gdb configure flag > is --enable-targets=all. > > Regarding the development environment, in this case I've sidestepped > the need for a proper image by putting everything into initramfs: > > workdir=$(mktemp -d) > tar -C "$workdir" -xf libbpf-vmtest-rootfs-2022.10.23-bullseye- > s390x.tar.zst > rsync -a selftests "$workdir" > (cd "$workdir" && find . | cpio -o --format=newc -R root:root) > >initrd.cpio > qemu-system-s390x -accel kvm -smp 2 -m 4G -kernel kbuild- > output/arch/s390/boot/bzImage -nographic -append 'nokaslr console=ttyS1 > rdinit=/bin/sh' -initrd initrd.cpio -s Nice trick! > For the regular development I have a different setup, with a > non-minimal Debian install inside the guest, and the testcases mounted > from the host using 9p. > > Best regards, > Ilya > > [1] > https://lore.kernel.org/bpf/20231010203512.385819-1-iii@linux.ibm.com/
On Tue, Oct 10, 2023 at 3:13 PM Song Liu <song@kernel.org> wrote: > > Hi Ilya, > > On Tue, Oct 10, 2023 at 1:49 PM Ilya Leoshkevich <iii@linux.ibm.com> wrote: > > > [...] > > > > > > Thanks, > > > Song > > > > > > PS: the root image from the CI is not easy to use. Hopefully you > > > have something better than that. > > > > Hi, > > > > Thanks for posting the analysis and links to the artifacts, that saved > > me quite some time. The crash is caused by a backchain issue in the > > trampoline code and has nothing to do with your patch; I've posted the > > fix here [1]. > > Thanks for the fix! > > Song Song, can you please resend your patch so that CI can test it again on top of Ilya's changes? Thanks! > > > > > The difference between compilers is most likely caused by different > > inlining decisions around lookup_elem_raw(). When it's inlined, the > > test becomes a no-op. > > > > Regarding GDB, Debian and Ubuntu have gdb-multiarch package. On Fedora > > one has to build it from source; the magic binutils-gdb configure flag > > is --enable-targets=all. > > > > Regarding the development environment, in this case I've sidestepped > > the need for a proper image by putting everything into initramfs: > > > > workdir=$(mktemp -d) > > tar -C "$workdir" -xf libbpf-vmtest-rootfs-2022.10.23-bullseye- > > s390x.tar.zst > > rsync -a selftests "$workdir" > > (cd "$workdir" && find . | cpio -o --format=newc -R root:root) > > >initrd.cpio > > qemu-system-s390x -accel kvm -smp 2 -m 4G -kernel kbuild- > > output/arch/s390/boot/bzImage -nographic -append 'nokaslr console=ttyS1 > > rdinit=/bin/sh' -initrd initrd.cpio -s > > Nice trick! > > > For the regular development I have a different setup, with a > > non-minimal Debian install inside the guest, and the testcases mounted > > from the host using 9p. > > > > Best regards, > > Ilya > > > > [1] > > https://lore.kernel.org/bpf/20231010203512.385819-1-iii@linux.ibm.com/
> On Oct 11, 2023, at 5:42 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Tue, Oct 10, 2023 at 3:13 PM Song Liu <song@kernel.org> wrote: >> >> Hi Ilya, >> >> On Tue, Oct 10, 2023 at 1:49 PM Ilya Leoshkevich <iii@linux.ibm.com> wrote: >>> >> [...] >>>> >>>> Thanks, >>>> Song >>>> >>>> PS: the root image from the CI is not easy to use. Hopefully you >>>> have something better than that. >>> >>> Hi, >>> >>> Thanks for posting the analysis and links to the artifacts, that saved >>> me quite some time. The crash is caused by a backchain issue in the >>> trampoline code and has nothing to do with your patch; I've posted the >>> fix here [1]. >> >> Thanks for the fix! >> >> Song > > Song, can you please resend your patch so that CI can test it again on > top of Ilya's changes? Thanks! Sure. Just sent the patch to bpf tree. Thanks, Song
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index a8c7e1c5abfa..fd8d4b0addfc 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -155,13 +155,15 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); preempt_disable(); + local_irq_save(flags); if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { __this_cpu_dec(*(htab->map_locked[hash])); + local_irq_restore(flags); preempt_enable(); return -EBUSY; } - raw_spin_lock_irqsave(&b->raw_lock, flags); + raw_spin_lock(&b->raw_lock); *pflags = flags; return 0; @@ -172,8 +174,9 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, unsigned long flags) { hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1); - raw_spin_unlock_irqrestore(&b->raw_lock, flags); + raw_spin_unlock(&b->raw_lock); __this_cpu_dec(*(htab->map_locked[hash])); + local_irq_restore(flags); preempt_enable(); }
htab_lock_bucket uses the following logic to avoid recursion: 1. preempt_disable(); 2. check percpu counter htab->map_locked[hash] for recursion; 2.1. if map_lock[hash] is already taken, return -BUSY; 3. raw_spin_lock_irqsave(); However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ logic will not able to access the same hash of the hashtab and get -EBUSY. This -EBUSY is not really necessary. Fix it by disabling IRQ before checking map_locked: 1. preempt_disable(); 2. local_irq_save(); 3. check percpu counter htab->map_locked[hash] for recursion; 3.1. if map_lock[hash] is already taken, return -BUSY; 4. raw_spin_lock(). Similarly, use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> --- Changes in v2: 1. Use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). (Andrii) --- kernel/bpf/hashtab.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)