Message ID | 20241031084246.20737-1-zhongjinji@honor.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | bpf: smp_wmb before bpf_ringbuf really commit | expand |
Hi, On 10/31/2024 4:42 PM, zhongjinji@honor.com wrote: > From: zhongjinji <zhongjinji@honor.com> > > To guarantee visibility of writing ringbuffer,it is necessary > to call smp_wmb before ringbuffer really commit. > for instance, when updating the data of buffer in cpu1, > it is not visible for the cpu2 which may be accessing buffer. This may > lead to the consumer accessing a incorrect data. using the smp_wmb > before commmit will guarantee that the consume can access the correct data. > > CPU1: > struct mem_event_t* data = bpf_ringbuf_reserve(); > data->type = MEM_EVENT_KSWAPD_WAKE; > data->event_data.kswapd_wake.node_id = args->nid; > bpf_ringbuf_commit(data); > > CPU2: > cons_pos = smp_load_acquire(r->consumer_pos); > len_ptr = r->data + (cons_pos & r->mask); > sample = (void *)len_ptr + BPF_RINGBUF_HDR_SZ; > access to sample It seems you didn't use the ringbuf related APIs (e.g., ring_buffer__consume()) in libbpf to access the ring buffer. In my understanding, the "xchg(&hdr->len, new_len)" in bpf_ringbuf_commit() works as the barrier to ensure the order of committed data and hdr->len, and accordingly the "smp_load_acquire(len_ptr)" in ringbuf_process_ring() in libbpf works as the paired barrier to ensure the order of hdr->len and the committed data. So I think the extra smp_wmb() in the kernel is not necessary here. Instead, you should fix your code in the userspace. > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > --- > kernel/bpf/ringbuf.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c > index e1cfe890e0be..a66059e2b0d6 100644 > --- a/kernel/bpf/ringbuf.c > +++ b/kernel/bpf/ringbuf.c > @@ -508,6 +508,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) > rec_pos = (void *)hdr - (void *)rb->data; > cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; > > + /* Make sure the modification of data is visible on other CPU's > + * before consume the event > + */ > + smp_wmb(); > if (flags & BPF_RB_FORCE_WAKEUP) > irq_work_queue(&rb->work); > else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP))
>It seems you didn't use the ringbuf related APIs (e.g., >ring_buffer__consume()) in libbpf to access the ring buffer. In my >understanding, the "xchg(&hdr->len, new_len)" in bpf_ringbuf_commit() >works as the barrier to ensure the order of committed data and hdr->len, >and accordingly the "smp_load_acquire(len_ptr)" in >ringbuf_process_ring() in libbpf works as the paired barrier to ensure >the order of hdr->len and the committed data. So I think the extra >smp_wmb() in the kernel is not necessary here. Instead, you should fix >your code in the userspace. >> >> Signed-off-by: zhongjinji <zhongjinji@honor.com> >> --- >> kernel/bpf/ringbuf.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c >> index e1cfe890e0be..a66059e2b0d6 100644 >> --- a/kernel/bpf/ringbuf.c >> +++ b/kernel/bpf/ringbuf.c >> @@ -508,6 +508,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) >> rec_pos = (void *)hdr - (void *)rb->data; >> cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; >> >> + /* Make sure the modification of data is visible on other CPU's >> + * before consume the event >> + */ >> + smp_wmb(); >> if (flags & BPF_RB_FORCE_WAKEUP) >> irq_work_queue(&rb->work); >> else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP)) well, I missed the implementation of __smp_store_release in the x86 architecture, it is not necessary because __smp_store_release has a memory barrier in x86, but it can't guarantee the order of committed data in arm64 because __smp_store_release does not include a memory barrier, and it just ensure the variable is visible. The implementation of ARM64 is as follows, in common/arch/arm64/include/asm/barrier.h #define __smp_load_acquire(p) \ ({ \ union { __unqual_scalar_typeof(*p) __val; char __c[1]; } __u; \ typeof(p) __p = (p); \ compiletime_assert_atomic_type(*p); \ kasan_check_read(__p, sizeof(*p)); \ switch (sizeof(*p)) { \ case 1: \ asm volatile ("ldarb %w0, %1" \ : "=r" (*(__u8 *)__u.__c) \ : "Q" (*__p) : "memory"); \ break; \ case 2: \ asm volatile ("ldarh %w0, %1" \ : "=r" (*(__u16 *)__u.__c) \ : "Q" (*__p) : "memory"); \ break; \ case 4: \ asm volatile ("ldar %w0, %1" \ : "=r" (*(__u32 *)__u.__c) \ : "Q" (*__p) : "memory"); \ break; \ case 8: \ asm volatile ("ldar %0, %1" \ : "=r" (*(__u64 *)__u.__c) \ : "Q" (*__p) : "memory"); \ break; \ } \ (typeof(*p))__u.__val; \ })
On Thu, Oct 31, 2024 at 9:19 AM zhongjinji <zhongjinji@honor.com> wrote: > > >It seems you didn't use the ringbuf related APIs (e.g., > >ring_buffer__consume()) in libbpf to access the ring buffer. In my > >understanding, the "xchg(&hdr->len, new_len)" in bpf_ringbuf_commit() > >works as the barrier to ensure the order of committed data and hdr->len, > >and accordingly the "smp_load_acquire(len_ptr)" in > >ringbuf_process_ring() in libbpf works as the paired barrier to ensure > >the order of hdr->len and the committed data. So I think the extra > >smp_wmb() in the kernel is not necessary here. Instead, you should fix > >your code in the userspace. > >> > >> Signed-off-by: zhongjinji <zhongjinji@honor.com> > >> --- > >> kernel/bpf/ringbuf.c | 4 ++++ > >> 1 file changed, 4 insertions(+) > >> > >> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c > >> index e1cfe890e0be..a66059e2b0d6 100644 > >> --- a/kernel/bpf/ringbuf.c > >> +++ b/kernel/bpf/ringbuf.c > >> @@ -508,6 +508,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) > >> rec_pos = (void *)hdr - (void *)rb->data; > >> cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; > >> > >> + /* Make sure the modification of data is visible on other CPU's > >> + * before consume the event > >> + */ > >> + smp_wmb(); > >> if (flags & BPF_RB_FORCE_WAKEUP) > >> irq_work_queue(&rb->work); > >> else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP)) > > well, I missed the implementation of __smp_store_release in the x86 architecture, > it is not necessary because __smp_store_release has a memory barrier in x86, yep, we do rely on xchg() and smp_load_acquire() pairing. I'm not familiar with arm64 specifics, I didn't know that this assumption is x86-64-specific. Cc'ed Paul who I believe suggested xchg() as a more performant way of achieving this. That happened years ago so all of us, I'm sure, lost a lot of context by now, but still, let's see. > but it can't guarantee the order of committed data in arm64 because > __smp_store_release does not include a memory barrier, and it just ensure > the variable is visible. The implementation of ARM64 is as follows, > in common/arch/arm64/include/asm/barrier.h > > #define __smp_load_acquire(p) \ > ({ \ > union { __unqual_scalar_typeof(*p) __val; char __c[1]; } __u; \ > typeof(p) __p = (p); \ > compiletime_assert_atomic_type(*p); \ > kasan_check_read(__p, sizeof(*p)); \ > switch (sizeof(*p)) { \ > case 1: \ > asm volatile ("ldarb %w0, %1" \ > : "=r" (*(__u8 *)__u.__c) \ > : "Q" (*__p) : "memory"); \ > break; \ > case 2: \ > asm volatile ("ldarh %w0, %1" \ > : "=r" (*(__u16 *)__u.__c) \ > : "Q" (*__p) : "memory"); \ > break; \ > case 4: \ > asm volatile ("ldar %w0, %1" \ > : "=r" (*(__u32 *)__u.__c) \ > : "Q" (*__p) : "memory"); \ > break; \ > case 8: \ > asm volatile ("ldar %0, %1" \ > : "=r" (*(__u64 *)__u.__c) \ > : "Q" (*__p) : "memory"); \ > break; \ > } \ > (typeof(*p))__u.__val; \ > }) >
On Fri, Nov 01, 2024 at 12:12:58PM -0700, Andrii Nakryiko wrote: > On Thu, Oct 31, 2024 at 9:19 AM zhongjinji <zhongjinji@honor.com> wrote: > > > > >It seems you didn't use the ringbuf related APIs (e.g., > > >ring_buffer__consume()) in libbpf to access the ring buffer. In my > > >understanding, the "xchg(&hdr->len, new_len)" in bpf_ringbuf_commit() > > >works as the barrier to ensure the order of committed data and hdr->len, > > >and accordingly the "smp_load_acquire(len_ptr)" in > > >ringbuf_process_ring() in libbpf works as the paired barrier to ensure > > >the order of hdr->len and the committed data. So I think the extra > > >smp_wmb() in the kernel is not necessary here. Instead, you should fix > > >your code in the userspace. > > >> > > >> Signed-off-by: zhongjinji <zhongjinji@honor.com> > > >> --- > > >> kernel/bpf/ringbuf.c | 4 ++++ > > >> 1 file changed, 4 insertions(+) > > >> > > >> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c > > >> index e1cfe890e0be..a66059e2b0d6 100644 > > >> --- a/kernel/bpf/ringbuf.c > > >> +++ b/kernel/bpf/ringbuf.c > > >> @@ -508,6 +508,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) > > >> rec_pos = (void *)hdr - (void *)rb->data; > > >> cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; > > >> > > >> + /* Make sure the modification of data is visible on other CPU's > > >> + * before consume the event > > >> + */ > > >> + smp_wmb(); This won't order the prior smp_load_acquire(), but you knew that already. > > >> if (flags & BPF_RB_FORCE_WAKEUP) > > >> irq_work_queue(&rb->work); > > >> else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP)) > > > > well, I missed the implementation of __smp_store_release in the x86 architecture, > > it is not necessary because __smp_store_release has a memory barrier in x86, > > yep, we do rely on xchg() and smp_load_acquire() pairing. I'm not > familiar with arm64 specifics, I didn't know that this assumption is > x86-64-specific. Cc'ed Paul who I believe suggested xchg() as a more > performant way of achieving this. That happened years ago so all of > us, I'm sure, lost a lot of context by now, but still, let's see. The xchg() is supposed to be fully ordered on all architectures, including arm64. And smp_load_acquire() provides acquire ordering, again on all architectures. So if you have something like this with x, y, and z all initially zero: CPU 0 CPU 1 WRITE_ONCE(x, 1); r3 = smp_load_acquire(&z); r1 = READ_ONCE(y); r4 = READ_ONCE(x); r2 = xchg(&z, 1); WRITE_ONCE(y, 1); Then if r3==1, it is guaranteed that r1==0 and r4==1. Or are you looking for some other ordering guarantee, and if so, what is it? Thanx, Paul > > but it can't guarantee the order of committed data in arm64 because > > __smp_store_release does not include a memory barrier, and it just ensure > > the variable is visible. The implementation of ARM64 is as follows, > > in common/arch/arm64/include/asm/barrier.h > > > > #define __smp_load_acquire(p) \ > > ({ \ > > union { __unqual_scalar_typeof(*p) __val; char __c[1]; } __u; \ > > typeof(p) __p = (p); \ > > compiletime_assert_atomic_type(*p); \ > > kasan_check_read(__p, sizeof(*p)); \ > > switch (sizeof(*p)) { \ > > case 1: \ > > asm volatile ("ldarb %w0, %1" \ > > : "=r" (*(__u8 *)__u.__c) \ > > : "Q" (*__p) : "memory"); \ > > break; \ > > case 2: \ > > asm volatile ("ldarh %w0, %1" \ > > : "=r" (*(__u16 *)__u.__c) \ > > : "Q" (*__p) : "memory"); \ > > break; \ > > case 4: \ > > asm volatile ("ldar %w0, %1" \ > > : "=r" (*(__u32 *)__u.__c) \ > > : "Q" (*__p) : "memory"); \ > > break; \ > > case 8: \ > > asm volatile ("ldar %0, %1" \ > > : "=r" (*(__u64 *)__u.__c) \ > > : "Q" (*__p) : "memory"); \ > > break; \ > > } \ > > (typeof(*p))__u.__val; \ > > }) > >
On Fri, Nov 1, 2024 at 1:30 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 12:12:58PM -0700, Andrii Nakryiko wrote: > > On Thu, Oct 31, 2024 at 9:19 AM zhongjinji <zhongjinji@honor.com> wrote: > > > > > > >It seems you didn't use the ringbuf related APIs (e.g., > > > >ring_buffer__consume()) in libbpf to access the ring buffer. In my > > > >understanding, the "xchg(&hdr->len, new_len)" in bpf_ringbuf_commit() > > > >works as the barrier to ensure the order of committed data and hdr->len, > > > >and accordingly the "smp_load_acquire(len_ptr)" in > > > >ringbuf_process_ring() in libbpf works as the paired barrier to ensure > > > >the order of hdr->len and the committed data. So I think the extra > > > >smp_wmb() in the kernel is not necessary here. Instead, you should fix > > > >your code in the userspace. > > > >> > > > >> Signed-off-by: zhongjinji <zhongjinji@honor.com> > > > >> --- > > > >> kernel/bpf/ringbuf.c | 4 ++++ > > > >> 1 file changed, 4 insertions(+) > > > >> > > > >> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c > > > >> index e1cfe890e0be..a66059e2b0d6 100644 > > > >> --- a/kernel/bpf/ringbuf.c > > > >> +++ b/kernel/bpf/ringbuf.c > > > >> @@ -508,6 +508,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) > > > >> rec_pos = (void *)hdr - (void *)rb->data; > > > >> cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; > > > >> > > > >> + /* Make sure the modification of data is visible on other CPU's > > > >> + * before consume the event > > > >> + */ > > > >> + smp_wmb(); > > This won't order the prior smp_load_acquire(), but you knew that already. > > > > >> if (flags & BPF_RB_FORCE_WAKEUP) > > > >> irq_work_queue(&rb->work); > > > >> else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP)) > > > > > > well, I missed the implementation of __smp_store_release in the x86 architecture, > > > it is not necessary because __smp_store_release has a memory barrier in x86, > > > > yep, we do rely on xchg() and smp_load_acquire() pairing. I'm not > > familiar with arm64 specifics, I didn't know that this assumption is > > x86-64-specific. Cc'ed Paul who I believe suggested xchg() as a more > > performant way of achieving this. That happened years ago so all of > > us, I'm sure, lost a lot of context by now, but still, let's see. > > The xchg() is supposed to be fully ordered on all architectures, > including arm64. And smp_load_acquire() provides acquire ordering, > again on all architectures. So if you have something like this I think this is what we need and should be sufficient. Thank you, Paul, for confirming! Seems like we don't really need to add anything extra here. > with x, y, and z all initially zero: > > CPU 0 CPU 1 > > WRITE_ONCE(x, 1); r3 = smp_load_acquire(&z); > r1 = READ_ONCE(y); r4 = READ_ONCE(x); > r2 = xchg(&z, 1); WRITE_ONCE(y, 1); > > Then if r3==1, it is guaranteed that r1==0 and r4==1. > > Or are you looking for some other ordering guarantee, and if so, what > is it? > > Thanx, Paul > > > > but it can't guarantee the order of committed data in arm64 because > > > __smp_store_release does not include a memory barrier, and it just ensure > > > the variable is visible. The implementation of ARM64 is as follows, > > > in common/arch/arm64/include/asm/barrier.h > > > > > > #define __smp_load_acquire(p) \ > > > ({ \ > > > union { __unqual_scalar_typeof(*p) __val; char __c[1]; } __u; \ > > > typeof(p) __p = (p); \ > > > compiletime_assert_atomic_type(*p); \ > > > kasan_check_read(__p, sizeof(*p)); \ > > > switch (sizeof(*p)) { \ > > > case 1: \ > > > asm volatile ("ldarb %w0, %1" \ > > > : "=r" (*(__u8 *)__u.__c) \ > > > : "Q" (*__p) : "memory"); \ > > > break; \ > > > case 2: \ > > > asm volatile ("ldarh %w0, %1" \ > > > : "=r" (*(__u16 *)__u.__c) \ > > > : "Q" (*__p) : "memory"); \ > > > break; \ > > > case 4: \ > > > asm volatile ("ldar %w0, %1" \ > > > : "=r" (*(__u32 *)__u.__c) \ > > > : "Q" (*__p) : "memory"); \ > > > break; \ > > > case 8: \ > > > asm volatile ("ldar %0, %1" \ > > > : "=r" (*(__u64 *)__u.__c) \ > > > : "Q" (*__p) : "memory"); \ > > > break; \ > > > } \ > > > (typeof(*p))__u.__val; \ > > > }) > > >
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index e1cfe890e0be..a66059e2b0d6 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -508,6 +508,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) rec_pos = (void *)hdr - (void *)rb->data; cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; + /* Make sure the modification of data is visible on other CPU's + * before consume the event + */ + smp_wmb(); if (flags & BPF_RB_FORCE_WAKEUP) irq_work_queue(&rb->work); else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP))