diff mbox series

[bpf-next] libbpf: fix ringbuf atomics, use EPOLLET

Message ID 20230719013552.3476856-1-awerner32@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next] libbpf: fix ringbuf atomics, use EPOLLET | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on s390x with gcc
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 9 this patch: 9
netdev/cc_maintainers fail 11 maintainers not CCed: daniel@iogearbox.net yhs@fb.com kpsingh@kernel.org martin.lau@linux.dev john.fastabend@gmail.com song@kernel.org sdf@google.com andrii@kernel.org jolsa@kernel.org haoluo@google.com ast@kernel.org
netdev/build_clang success Errors and warnings before: 9 this patch: 9
netdev/verify_signedoff fail author Signed-off-by missing
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 9 this patch: 9
netdev/checkpatch warning CHECK: spaces preferred around that '|' (ctx:VxV) WARNING: line length of 86 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16

Commit Message

Andrew Werner July 19, 2023, 1:35 a.m. UTC
This patch fixes enhances the synchronization between libbpf and
the producer in the kernel so that notifications cannot be lost
because the producer reads a stale view of the consumer position
while the consumer also reads a stale view of either the producer
position or the header.

The problem before this change was that nothing enforced a happens
before relationship between either of the writes and the subsequent
reads.

The use of a sequentially consistent write ensures that the write
to the consumer position is either ordered before the producer
clears the busy bit, in which case the producer will see that
the consumer caught up, or the write will occur after the producer
has cleared the busy bit, in which case the new message will be
visible.

All of this is in service of using EPOLLET, which will perform
fewer wakeups and generally less work. This is borne out in the
benchmark data below. Note that without the atomics change, the use
of EPOLLET does not work, and the benchmarks and tests show it.

The below raw benchmarks are below (I've omitted the irrelevant
ones for brevity). The benchmarks were run on a 32-thread AMD
Ryzen 9 7950X 16-Core Processor.

The summary of the results is that the producer is that in
almost all cases, the benchmarks are substantially improved.
The only case which seems worse is "Ringbuf sampled,
reserve+commit vs output", for the "reserve" case. I guess this
makes sense because the consumer piece is more expensive, and
the sampled notifications mean there's not a lot of time interacting
with epoll.

Credit for the discovery of the bug[1] and guidance on how to
fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>.

New:
```
Single-producer, parallel producer
==================================
rb-libbpf            43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s)

Single-producer, parallel producer, sampled notification
========================================================
rb-libbpf            41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s)

Single-producer, back-to-back mode
==================================
rb-libbpf            60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s)
rb-libbpf-sampled    59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s)

Ringbuf back-to-back, effect of sample rate
===========================================
rb-sampled-1         1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-sampled-5         7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s)
rb-sampled-10        13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s)
rb-sampled-25        26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s)
rb-sampled-50        39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s)
rb-sampled-100       51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s)
rb-sampled-250       67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s)
rb-sampled-500       75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s)
rb-sampled-1000      74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s)
rb-sampled-2000      81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s)
rb-sampled-3000      82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s)

Ringbuf back-to-back, reserve+commit vs output
==============================================
reserve              78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s)
output               68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s)

Ringbuf sampled, reserve+commit vs output
=========================================
reserve-sampled      30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s)
output-sampled       30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s)

Single-producer, consumer/producer competing on the same CPU, low batch count
=============================================================================
rb-libbpf            0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s)

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1  44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s)
rb-libbpf nr_prod 2  23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3  14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4  14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8  6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s)
rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s)
rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s)
rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s)
rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s)
rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s)
rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s)
rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s)
rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s)
```

Old:
```
Single-producer, parallel producer
==================================
rb-libbpf            20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s)

Single-producer, parallel producer, sampled notification
========================================================
rb-libbpf            29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s)

Single-producer, back-to-back mode
==================================
rb-libbpf            60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s)
rb-libbpf-sampled    60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s)

Ringbuf back-to-back, effect of sample rate
===========================================
rb-sampled-1         1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-sampled-5         7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s)
rb-sampled-10        13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s)
rb-sampled-25        26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s)
rb-sampled-50        40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s)
rb-sampled-100       54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s)
rb-sampled-250       66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s)
rb-sampled-500       76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s)
rb-sampled-1000      80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s)
rb-sampled-2000      83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s)
rb-sampled-3000      83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s)

Ringbuf back-to-back, reserve+commit vs output
==============================================
reserve              77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s)
output               67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s)

Ringbuf sampled, reserve+commit vs output
=========================================
reserve-sampled      33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s)
output-sampled       30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s)

Single-producer, consumer/producer competing on the same CPU, low batch count
=============================================================================
rb-libbpf            0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s)

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1  18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2  22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3  11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4  11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8  5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s)
rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s)
rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s)
rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s)
rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s)
rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s)
rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s)
```

[1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r
[2]: https://github.com/aya-rs/aya/pull/294#issuecomment-1634705909
---
 tools/lib/bpf/ringbuf.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)
diff mbox series

Patch

diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
index 02199364db13..32a618b4c4d6 100644
--- a/tools/lib/bpf/ringbuf.c
+++ b/tools/lib/bpf/ringbuf.c
@@ -148,7 +148,7 @@  int ring_buffer__add(struct ring_buffer *rb, int map_fd,
 	e = &rb->events[rb->ring_cnt];
 	memset(e, 0, sizeof(*e));
 
-	e->events = EPOLLIN;
+	e->events = EPOLLIN|EPOLLET;
 	e->data.fd = rb->ring_cnt;
 	if (epoll_ctl(rb->epoll_fd, EPOLL_CTL_ADD, map_fd, e) < 0) {
 		err = -errno;
@@ -260,7 +260,19 @@  static int64_t ringbuf_process_ring(struct ring *r)
 				cnt++;
 			}
 
-			smp_store_release(r->consumer_pos, cons_pos);
+			/* This ordering is critical to ensure that an epoll
+			 * notification gets sent in the case where the next
+			 * iteration of this loop discovers that the consumer is
+			 * caught up. If this store were performed using
+			 * RELEASE, it'd be possible for the consumer to fail to
+			 * see an updated producer position and for that
+			 * producer to fail to see this write. By making this
+			 * write SEQ_CST, we know that either the newly produced
+			 * message will be visible to theconsumer, or the
+			 * producer will discover that the consumer is caught
+			 * up, and will ensure a notification is sent.
+			 */
+			__atomic_store_n(r->consumer_pos, cons_pos, __ATOMIC_SEQ_CST);
 		}
 	} while (got_new_data);
 done: