Message ID | cover.1726480607.git.lorenzo@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Introduce GRO support to cpumap codebase | expand |
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Mon, 16 Sep 2024 12:13:42 +0200 > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > NAPI-kthread pinned on the selected cpu. > > Changes in rfc v2: > - get rid of dummy netdev dependency > > Lorenzo Bianconi (3): > net: Add napi_init_for_gro routine > net: add napi_threaded_poll to netdevice.h > bpf: cpumap: Add gro support Oh okay, so it's still uses a NAPI. When I'm back from the conferences (next week), I might rebase and send the solution where I only use the GRO part of it, i.e. no napi_schedule()/poll()/napi_complete() logics. > > include/linux/netdevice.h | 3 + > kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > net/core/dev.c | 27 ++++++--- > 3 files changed, 73 insertions(+), 80 deletions(-) Thanks, Olek
Hi Lorenzo, On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > NAPI-kthread pinned on the selected cpu. > > Changes in rfc v2: > - get rid of dummy netdev dependency > > Lorenzo Bianconi (3): > net: Add napi_init_for_gro routine > net: add napi_threaded_poll to netdevice.h > bpf: cpumap: Add gro support > > include/linux/netdevice.h | 3 + > kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > net/core/dev.c | 27 ++++++--- > 3 files changed, 73 insertions(+), 80 deletions(-) > > -- > 2.46.0 > Sorry about the long delay - finally caught up to everything after conferences. I re-ran my synthetic tests (including baseline). v2 is somehow showing 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only variable I changed is kernel version - steering prog is active for both. Baseline (again) ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 cpumap NAPI patches v2 Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 Delta 1.04% -3.62% 7.41% 8.57% 30.47% Thanks, Daniel
> Hi Lorenzo, > > On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > > NAPI-kthread pinned on the selected cpu. > > > > Changes in rfc v2: > > - get rid of dummy netdev dependency > > > > Lorenzo Bianconi (3): > > net: Add napi_init_for_gro routine > > net: add napi_threaded_poll to netdevice.h > > bpf: cpumap: Add gro support > > > > include/linux/netdevice.h | 3 + > > kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > > net/core/dev.c | 27 ++++++--- > > 3 files changed, 73 insertions(+), 80 deletions(-) > > > > -- > > 2.46.0 > > > > Sorry about the long delay - finally caught up to everything after > conferences. > > I re-ran my synthetic tests (including baseline). v2 is somehow showing > 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only > variable I changed is kernel version - steering prog is active for both. > > > Baseline (again) > > ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 > Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 > Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 > Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 > Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 > Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 > > cpumap NAPI patches v2 > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 > Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 > Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 > Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 > Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 > Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 > Delta 1.04% -3.62% 7.41% 8.57% 30.47% > > Thanks, > Daniel Hi Daniel, cool, thx for testing it. @Olek: how do we want to proceed on it? Are you still working on it or do you want me to send a regular patch for it? Regards, Lorenzo
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Wed, 9 Oct 2024 12:46:00 +0200 >> Hi Lorenzo, >> >> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>> NAPI-kthread pinned on the selected cpu. >>> >>> Changes in rfc v2: >>> - get rid of dummy netdev dependency >>> >>> Lorenzo Bianconi (3): >>> net: Add napi_init_for_gro routine >>> net: add napi_threaded_poll to netdevice.h >>> bpf: cpumap: Add gro support >>> >>> include/linux/netdevice.h | 3 + >>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>> net/core/dev.c | 27 ++++++--- >>> 3 files changed, 73 insertions(+), 80 deletions(-) >>> >>> -- >>> 2.46.0 >>> >> >> Sorry about the long delay - finally caught up to everything after >> conferences. >> >> I re-ran my synthetic tests (including baseline). v2 is somehow showing >> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >> variable I changed is kernel version - steering prog is active for both. >> >> >> Baseline (again) >> >> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >> >> cpumap NAPI patches v2 >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >> >> Thanks, >> Daniel > > Hi Daniel, > > cool, thx for testing it. > > @Olek: how do we want to proceed on it? Are you still working on it or do you want me > to send a regular patch for it? Hi, I had a small vacation, sorry. I'm starting working on it again today. > > Regards, > Lorenzo Thanks, Olek
> From: Lorenzo Bianconi <lorenzo@kernel.org> > Date: Wed, 9 Oct 2024 12:46:00 +0200 > > >> Hi Lorenzo, > >> > >> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > >>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > >>> NAPI-kthread pinned on the selected cpu. > >>> > >>> Changes in rfc v2: > >>> - get rid of dummy netdev dependency > >>> > >>> Lorenzo Bianconi (3): > >>> net: Add napi_init_for_gro routine > >>> net: add napi_threaded_poll to netdevice.h > >>> bpf: cpumap: Add gro support > >>> > >>> include/linux/netdevice.h | 3 + > >>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > >>> net/core/dev.c | 27 ++++++--- > >>> 3 files changed, 73 insertions(+), 80 deletions(-) > >>> > >>> -- > >>> 2.46.0 > >>> > >> > >> Sorry about the long delay - finally caught up to everything after > >> conferences. > >> > >> I re-ran my synthetic tests (including baseline). v2 is somehow showing > >> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only > >> variable I changed is kernel version - steering prog is active for both. > >> > >> > >> Baseline (again) > >> > >> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 > >> > >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 > >> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 > >> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 > >> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 > >> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 > >> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 > >> > >> cpumap NAPI patches v2 > >> > >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 > >> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 > >> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 > >> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 > >> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 > >> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 > >> Delta 1.04% -3.62% 7.41% 8.57% 30.47% > >> > >> Thanks, > >> Daniel > > > > Hi Daniel, > > > > cool, thx for testing it. > > > > @Olek: how do we want to proceed on it? Are you still working on it or do you want me > > to send a regular patch for it? > > Hi, > > I had a small vacation, sorry. I'm starting working on it again today. ack, no worries. Are you going to rebase the other patches on top of it or are you going to try a different approach? Regards, Lorenzo > > > > > Regards, > > Lorenzo > > Thanks, > Olek
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Wed, 9 Oct 2024 14:47:58 +0200 >> From: Lorenzo Bianconi <lorenzo@kernel.org> >> Date: Wed, 9 Oct 2024 12:46:00 +0200 >> >>>> Hi Lorenzo, >>>> >>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>> NAPI-kthread pinned on the selected cpu. >>>>> >>>>> Changes in rfc v2: >>>>> - get rid of dummy netdev dependency >>>>> >>>>> Lorenzo Bianconi (3): >>>>> net: Add napi_init_for_gro routine >>>>> net: add napi_threaded_poll to netdevice.h >>>>> bpf: cpumap: Add gro support >>>>> >>>>> include/linux/netdevice.h | 3 + >>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>> net/core/dev.c | 27 ++++++--- >>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>> >>>>> -- >>>>> 2.46.0 >>>>> >>>> >>>> Sorry about the long delay - finally caught up to everything after >>>> conferences. >>>> >>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>> variable I changed is kernel version - steering prog is active for both. >>>> >>>> >>>> Baseline (again) >>>> >>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>> >>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>> >>>> cpumap NAPI patches v2 >>>> >>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>> >>>> Thanks, >>>> Daniel >>> >>> Hi Daniel, >>> >>> cool, thx for testing it. >>> >>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>> to send a regular patch for it? >> >> Hi, >> >> I had a small vacation, sorry. I'm starting working on it again today. > > ack, no worries. Are you going to rebase the other patches on top of it > or are you going to try a different approach? I'll try the approach without NAPI as Kuba asks and let Daniel test it, then we'll see. BTW I'm curious how he got this boost on v2, from what I see you didn't change the implementation that much? Thanks, Olek
From: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Wed, 9 Oct 2024 14:50:42 +0200 > From: Lorenzo Bianconi <lorenzo@kernel.org> > Date: Wed, 9 Oct 2024 14:47:58 +0200 > >>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>> Date: Wed, 9 Oct 2024 12:46:00 +0200 >>> >>>>> Hi Lorenzo, >>>>> >>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>>> NAPI-kthread pinned on the selected cpu. >>>>>> >>>>>> Changes in rfc v2: >>>>>> - get rid of dummy netdev dependency >>>>>> >>>>>> Lorenzo Bianconi (3): >>>>>> net: Add napi_init_for_gro routine >>>>>> net: add napi_threaded_poll to netdevice.h >>>>>> bpf: cpumap: Add gro support >>>>>> >>>>>> include/linux/netdevice.h | 3 + >>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>>> net/core/dev.c | 27 ++++++--- >>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>>> >>>>>> -- >>>>>> 2.46.0 >>>>>> >>>>> >>>>> Sorry about the long delay - finally caught up to everything after >>>>> conferences. >>>>> >>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>>> variable I changed is kernel version - steering prog is active for both. >>>>> >>>>> >>>>> Baseline (again) >>>>> >>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>>> >>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>>> >>>>> cpumap NAPI patches v2 >>>>> >>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>>> >>>>> Thanks, >>>>> Daniel >>>> >>>> Hi Daniel, >>>> >>>> cool, thx for testing it. >>>> >>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>>> to send a regular patch for it? >>> >>> Hi, >>> >>> I had a small vacation, sorry. I'm starting working on it again today. >> >> ack, no worries. Are you going to rebase the other patches on top of it >> or are you going to try a different approach? > > I'll try the approach without NAPI as Kuba asks and let Daniel test it, > then we'll see. For now, I have the same results without NAPI as with your series, so I'll push it soon and let Daniel test. (I simply decoupled GRO and NAPI and used the former in cpumap, but the kthread logic didn't change) > > BTW I'm curious how he got this boost on v2, from what I see you didn't > change the implementation that much? Thanks, Olek
From: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Tue, 22 Oct 2024 17:51:43 +0200 > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Wed, 9 Oct 2024 14:50:42 +0200 > >> From: Lorenzo Bianconi <lorenzo@kernel.org> >> Date: Wed, 9 Oct 2024 14:47:58 +0200 >> >>>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>>> Date: Wed, 9 Oct 2024 12:46:00 +0200 >>>> >>>>>> Hi Lorenzo, >>>>>> >>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>>>> NAPI-kthread pinned on the selected cpu. >>>>>>> >>>>>>> Changes in rfc v2: >>>>>>> - get rid of dummy netdev dependency >>>>>>> >>>>>>> Lorenzo Bianconi (3): >>>>>>> net: Add napi_init_for_gro routine >>>>>>> net: add napi_threaded_poll to netdevice.h >>>>>>> bpf: cpumap: Add gro support >>>>>>> >>>>>>> include/linux/netdevice.h | 3 + >>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>>>> net/core/dev.c | 27 ++++++--- >>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>>>> >>>>>>> -- >>>>>>> 2.46.0 >>>>>>> >>>>>> >>>>>> Sorry about the long delay - finally caught up to everything after >>>>>> conferences. >>>>>> >>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>>>> variable I changed is kernel version - steering prog is active for both. >>>>>> >>>>>> >>>>>> Baseline (again) >>>>>> >>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>>>> >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>>>> >>>>>> cpumap NAPI patches v2 >>>>>> >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>>>> >>>>>> Thanks, >>>>>> Daniel >>>>> >>>>> Hi Daniel, >>>>> >>>>> cool, thx for testing it. >>>>> >>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>>>> to send a regular patch for it? >>>> >>>> Hi, >>>> >>>> I had a small vacation, sorry. I'm starting working on it again today. >>> >>> ack, no worries. Are you going to rebase the other patches on top of it >>> or are you going to try a different approach? >> >> I'll try the approach without NAPI as Kuba asks and let Daniel test it, >> then we'll see. > > For now, I have the same results without NAPI as with your series, so > I'll push it soon and let Daniel test. > > (I simply decoupled GRO and NAPI and used the former in cpumap, but the > kthread logic didn't change) > >> >> BTW I'm curious how he got this boost on v2, from what I see you didn't >> change the implementation that much? Hi Daniel, Sorry for the delay. Please test [0]. [0] https://github.com/alobakin/linux/commits/cpumap-old Thanks, Olek
On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Tue, 22 Oct 2024 17:51:43 +0200 > >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Date: Wed, 9 Oct 2024 14:50:42 +0200 >> >>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>> Date: Wed, 9 Oct 2024 14:47:58 +0200 >>> >>>>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200 >>>>> >>>>>>> Hi Lorenzo, >>>>>>> >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>>>>> NAPI-kthread pinned on the selected cpu. >>>>>>>> >>>>>>>> Changes in rfc v2: >>>>>>>> - get rid of dummy netdev dependency >>>>>>>> >>>>>>>> Lorenzo Bianconi (3): >>>>>>>> net: Add napi_init_for_gro routine >>>>>>>> net: add napi_threaded_poll to netdevice.h >>>>>>>> bpf: cpumap: Add gro support >>>>>>>> >>>>>>>> include/linux/netdevice.h | 3 + >>>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>>>>> net/core/dev.c | 27 ++++++--- >>>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>>>>> >>>>>>>> -- >>>>>>>> 2.46.0 >>>>>>>> >>>>>>> >>>>>>> Sorry about the long delay - finally caught up to everything after >>>>>>> conferences. >>>>>>> >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>>>>> variable I changed is kernel version - steering prog is active for both. >>>>>>> >>>>>>> >>>>>>> Baseline (again) >>>>>>> >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>>>>> >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>>>>> >>>>>>> cpumap NAPI patches v2 >>>>>>> >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>>>>> >>>>>>> Thanks, >>>>>>> Daniel >>>>>> >>>>>> Hi Daniel, >>>>>> >>>>>> cool, thx for testing it. >>>>>> >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>>>>> to send a regular patch for it? >>>>> >>>>> Hi, >>>>> >>>>> I had a small vacation, sorry. I'm starting working on it again today. >>>> >>>> ack, no worries. Are you going to rebase the other patches on top of it >>>> or are you going to try a different approach? >>> >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it, >>> then we'll see. >> >> For now, I have the same results without NAPI as with your series, so >> I'll push it soon and let Daniel test. >> >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the >> kthread logic didn't change) >> >>> >>> BTW I'm curious how he got this boost on v2, from what I see you didn't >>> change the implementation that much? > > Hi Daniel, > > Sorry for the delay. Please test [0]. > > [0] https://github.com/alobakin/linux/commits/cpumap-old > > Thanks, > Olek Ack. Will do probably early next week.
Hi Olek, Here are the results. On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > From: Alexander Lobakin <aleksander.lobakin@intel.com> > > Date: Tue, 22 Oct 2024 17:51:43 +0200 > > > >> From: Alexander Lobakin <aleksander.lobakin@intel.com> > >> Date: Wed, 9 Oct 2024 14:50:42 +0200 > >> > >>> From: Lorenzo Bianconi <lorenzo@kernel.org> > >>> Date: Wed, 9 Oct 2024 14:47:58 +0200 > >>> > >>>>> From: Lorenzo Bianconi <lorenzo@kernel.org> > >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200 > >>>>> > >>>>>>> Hi Lorenzo, > >>>>>>> > >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > >>>>>>>> NAPI-kthread pinned on the selected cpu. > >>>>>>>> > >>>>>>>> Changes in rfc v2: > >>>>>>>> - get rid of dummy netdev dependency > >>>>>>>> > >>>>>>>> Lorenzo Bianconi (3): > >>>>>>>> net: Add napi_init_for_gro routine > >>>>>>>> net: add napi_threaded_poll to netdevice.h > >>>>>>>> bpf: cpumap: Add gro support > >>>>>>>> > >>>>>>>> include/linux/netdevice.h | 3 + > >>>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > >>>>>>>> net/core/dev.c | 27 ++++++--- > >>>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) > >>>>>>>> > >>>>>>>> -- > >>>>>>>> 2.46.0 > >>>>>>>> > >>>>>>> > >>>>>>> Sorry about the long delay - finally caught up to everything after > >>>>>>> conferences. > >>>>>>> > >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing > >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only > >>>>>>> variable I changed is kernel version - steering prog is active for both. > >>>>>>> > >>>>>>> > >>>>>>> Baseline (again) > >>>>>>> > >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 > >>>>>>> > >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 > >>>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 > >>>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 > >>>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 > >>>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 > >>>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 > >>>>>>> > >>>>>>> cpumap NAPI patches v2 > >>>>>>> > >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 > >>>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 > >>>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 > >>>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 > >>>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 > >>>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 > >>>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Daniel > >>>>>> > >>>>>> Hi Daniel, > >>>>>> > >>>>>> cool, thx for testing it. > >>>>>> > >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me > >>>>>> to send a regular patch for it? > >>>>> > >>>>> Hi, > >>>>> > >>>>> I had a small vacation, sorry. I'm starting working on it again today. > >>>> > >>>> ack, no worries. Are you going to rebase the other patches on top of it > >>>> or are you going to try a different approach? > >>> > >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it, > >>> then we'll see. > >> > >> For now, I have the same results without NAPI as with your series, so > >> I'll push it soon and let Daniel test. > >> > >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the > >> kthread logic didn't change) > >> > >>> > >>> BTW I'm curious how he got this boost on v2, from what I see you didn't > >>> change the implementation that much? > > > > Hi Daniel, > > > > Sorry for the delay. Please test [0]. > > > > [0] https://github.com/alobakin/linux/commits/cpumap-old > > > > Thanks, > > Olek > > Ack. Will do probably early next week. > Baseline (again) Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 cpumap v2 Olek Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 Delta 0.92% -0.53% 0.33% 0.85% -41.32% It's very interesting that we see -40% tput w/ the patches. I went back and double checked and it seems the numbers are right. Here's the some output from some profiles I took with: perf record -e cycles:k -a -- sleep 10 perf --no-pager diff perf.data.baseline perf.data.withpatches > ... # Event 'cycles:k' # Baseline Delta Abs Shared Object Symbol 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter 3.57% -2.56% bpf_prog_954ab9c8c8b5e42f_latency [k] bpf_prog_954ab9c8c8b5e42f_latency +2.22% bpf_prog_5c74b34eb24d5c9b_steering [k] bpf_prog_5c74b34eb24d5c9b_steering 2.61% -1.88% [kernel.kallsyms] [k] __skb_datagram_iter 0.55% +1.53% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 4.52% -1.46% [kernel.kallsyms] [k] read_tsc 0.34% +1.42% [kernel.kallsyms] [k] __slab_free 0.97% +1.18% [kernel.kallsyms] [k] do_idle 1.35% +1.17% [kernel.kallsyms] [k] cpuidle_enter_state 1.89% -1.15% [kernel.kallsyms] [k] tcp_ack 2.08% +1.14% [kernel.kallsyms] [k] _raw_spin_lock +1.13% <redacted> 0.22% +1.02% [kernel.kallsyms] [k] __sock_wfree 2.23% -1.02% [kernel.kallsyms] [k] bpf_dynptr_slice 0.00% +0.98% [kernel.kallsyms] [k] tcp6_gro_receive 2.91% -0.98% [kernel.kallsyms] [k] csum_partial 0.62% +0.94% [kernel.kallsyms] [k] skb_release_data +0.81% [kernel.kallsyms] [k] memset 0.16% +0.74% [kernel.kallsyms] [k] bnxt_tx_int 0.00% +0.74% [kernel.kallsyms] [k] dev_gro_receive 0.36% +0.74% [kernel.kallsyms] [k] __tcp_transmit_skb +0.72% [kernel.kallsyms] [k] tcp_gro_receive 1.10% -0.66% [kernel.kallsyms] [k] ep_poll_callback 1.52% -0.65% [kernel.kallsyms] [k] page_pool_put_unrefed_netmem 0.75% -0.57% [kernel.kallsyms] [k] bnxt_rx_pkt 1.10% +0.56% [kernel.kallsyms] [k] native_sched_clock 0.16% +0.53% <redacted> 0.83% -0.53% [kernel.kallsyms] [k] skb_try_coalesce 0.60% +0.53% [kernel.kallsyms] [k] eth_type_trans 1.65% -0.51% [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.14% +0.50% [kernel.kallsyms] [k] bnxt_start_xmit 0.54% -0.48% [kernel.kallsyms] [k] __skb_frag_unref 0.91% +0.48% [cls_bpf] [k] 0x0000000000000010 0.00% +0.47% [kernel.kallsyms] [k] ipv6_gro_receive 0.76% -0.45% [kernel.kallsyms] [k] tcp_rcv_established 0.94% -0.45% [kernel.kallsyms] [k] __inet6_lookup_established 0.31% +0.43% [kernel.kallsyms] [k] __sched_text_start 0.21% +0.43% [kernel.kallsyms] [k] poll_idle 0.91% -0.42% [kernel.kallsyms] [k] tcp_try_coalesce 0.91% -0.42% [kernel.kallsyms] [k] kmem_cache_free 1.13% +0.42% [kernel.kallsyms] [k] __bnxt_poll_work 0.48% -0.41% [kernel.kallsyms] [k] tcp_urg +0.39% [kernel.kallsyms] [k] memcpy 0.51% -0.38% [kernel.kallsyms] [k] _raw_read_unlock_irqrestore +0.38% [kernel.kallsyms] [k] __skb_gro_checksum_complete +0.37% [kernel.kallsyms] [k] irq_entries_start 0.16% +0.36% [kernel.kallsyms] [k] bpf_sk_storage_get 0.62% -0.36% [kernel.kallsyms] [k] page_pool_refill_alloc_cache 0.08% +0.35% [kernel.kallsyms] [k] ip6_finish_output2 0.14% +0.34% [kernel.kallsyms] [k] bnxt_poll_p5 0.06% +0.33% [sch_fq] [k] 0x0000000000000020 0.04% +0.32% [kernel.kallsyms] [k] __dev_queue_xmit 0.75% -0.32% [kernel.kallsyms] [k] __xdp_build_skb_from_frame 0.67% -0.31% [kernel.kallsyms] [k] sock_def_readable 0.05% +0.31% [kernel.kallsyms] [k] netif_skb_features +0.30% [kernel.kallsyms] [k] tcp_gro_pull_header 0.49% -0.29% [kernel.kallsyms] [k] napi_pp_put_page 0.18% +0.29% [kernel.kallsyms] [k] call_function_single_prep_ipi 0.40% -0.28% [kernel.kallsyms] [k] _raw_read_lock_irqsave 0.11% +0.27% [kernel.kallsyms] [k] raw6_local_deliver 0.18% +0.26% [kernel.kallsyms] [k] ip6_dst_check 0.42% -0.26% [kernel.kallsyms] [k] netif_receive_skb_list_internal 0.05% +0.26% [kernel.kallsyms] [k] __qdisc_run 0.75% +0.25% [kernel.kallsyms] [k] __build_skb_around 0.05% +0.25% [kernel.kallsyms] [k] htab_map_hash 0.09% +0.24% [kernel.kallsyms] [k] net_rx_action 0.07% +0.23% <redacted> 0.45% -0.23% [kernel.kallsyms] [k] migrate_enable 0.48% -0.23% [kernel.kallsyms] [k] mem_cgroup_charge_skmem 0.26% +0.23% [kernel.kallsyms] [k] __switch_to 0.15% +0.22% [kernel.kallsyms] [k] sock_rfree 0.30% -0.22% [kernel.kallsyms] [k] tcp_add_backlog <snip> 5.68% bpf_prog_17fea1bb6503ed98_steering [k] bpf_prog_17fea1bb6503ed98_steering 2.10% [kernel.kallsyms] [k] __skb_checksum_complete 0.71% [kernel.kallsyms] [k] __memset 0.54% [kernel.kallsyms] [k] __memcpy 0.18% [kernel.kallsyms] [k] __irqentry_text_start <snip> Please let me know if you want me to collect any other data. Thanks, Daniel
From: Daniel Xu <dxu@dxuuu.xyz> Date: Fri, 22 Nov 2024 17:10:06 -0700 > Hi Olek, > > Here are the results. > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >> >> >> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: [...] > Baseline (again) > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > cpumap v2 Olek > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > It's very interesting that we see -40% tput w/ the patches. I went back Oh no, I messed up something =\ Could you please also test not the whole series, but patches 1-3 (up to "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb array...")? Would be great to see whether this implementation works worse right from the start or I just broke something later on. > and double checked and it seems the numbers are right. Here's the > some output from some profiles I took with: > > perf record -e cycles:k -a -- sleep 10 > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > # Event 'cycles:k' > # Baseline Delta Abs Shared Object Symbol > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter BTW, what CONFIG_HZ do you have on the kernel you're testing with? Thanks, Olek
On Mon, Nov 25, 2024 at 04:12:24PM GMT, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > > Hi Olek, > > > > Here are the results. > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > >> > >> > >> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > [...] > > > Baseline (again) > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > > > cpumap v2 Olek > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > > > > It's very interesting that we see -40% tput w/ the patches. I went back > > Oh no, I messed up something =\ > > Could you please also test not the whole series, but patches 1-3 (up to > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > array...")? Would be great to see whether this implementation works > worse right from the start or I just broke something later on. Will do. > > > and double checked and it seems the numbers are right. Here's the > > some output from some profiles I took with: > > > > perf record -e cycles:k -a -- sleep 10 > > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > > > # Event 'cycles:k' > > # Baseline Delta Abs Shared Object Symbol > > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > > BTW, what CONFIG_HZ do you have on the kernel you're testing with? # zgrep CONFIG_HZ /proc/config.gz # CONFIG_HZ_PERIODIC is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 Just curious - why do you ask? Thanks, Daniel
On 25/11/2024 16.12, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Fri, 22 Nov 2024 17:10:06 -0700 > >> Hi Olek, >> >> Here are the results. >> >> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>> >>> >>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > [...] > >> Baseline (again) >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 >> Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 >> Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 >> Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 >> Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 >> Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 >> We need to talk about what we are measuring, and how to control the experiment setup to get reproducible results. Especially controlling on what CPU cores our code paths are executing. In above "baseline" case, we have two processes/tasks executing: (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) (2) Userspace netserver process TCP receiving data from socket. My experience is that you will see two noticeable different throughput performance results depending on whether (1) and (2) is executing on the *same* CPU (multi-tasking context-switching), or executing in parallel (e.g. pinned) on two different CPU cores. The netperf command have an option -T lcpu,remcpu Request that netperf be bound to local CPU lcpu and/or netserver be bound to remote CPU rcpu. Verify setting by listing pinning like this: for PID in $(pidof netserver); do taskset -pc $PID ; done You can also set pinning runtime like this: export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID; done For troubleshooting, I like to use the periodic 1 sec (netperf -D1) output and adjust pinning runtime to observe the effect quickly. My experience is unfortunately that TCP results have a lot of variation (thanks for incliding 5 runs in your benchmarks), as it depends on tasks timing, that can get affected by CPU sleep states. The systems CPU latency setting can be seen in /dev/cpu_dma_latency, which can be read like this: sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm as it requires holding the file open. E.g I play with these profiles: sudo tuned-adm profile throughput-performance sudo tuned-adm profile latency-performance sudo tuned-adm profile network-latency >> cpumap v2 Olek >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 >> Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 >> Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 >> Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 >> Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 >> Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 >> Delta 0.92% -0.53% 0.33% 0.85% -41.32% >> >> We now three processes/tasks executing: (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) (3) Userspace netserver process TCP receiving data from socket. Again, now the performance is going to depend on depending on which CPU cores the processes/tasks are running and whether some are sharing the same CPU. (There are both wakeup timing and cache-line effects). There are now more combinations to test... CPUmap is a CPU scaling facility, and you will likely also see different CPU utilization on the difference cores one you start to pin these to control the scenarios. >> It's very interesting that we see -40% tput w/ the patches. I went back > Sad that we see -40% throughput... but do we know what CPU cores the now three different tasks/processes run on(?) > Oh no, I messed up something =\ > > Could you please also test not the whole series, but patches 1-3 (up to > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > array...")? Would be great to see whether this implementation works > worse right from the start or I just broke something later on. > >> and double checked and it seems the numbers are right. Here's the >> some output from some profiles I took with: >> >> perf record -e cycles:k -a -- sleep 10 >> perf --no-pager diff perf.data.baseline perf.data.withpatches > ... >> >> # Event 'cycles:k' >> # Baseline Delta Abs Shared Object Symbol >> 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > I really appreciate that you provide perf data and perf diff, but as described above, we need data and information on what CPU cores are running which workload. Fortunately perf diff (and perf report) support doing like this: perf diff --sort=cpu,symbol But then you also need to control the CPUs used in experiment for the diff to work. I hope I made sense as these kind of CPU scaling benchmarks are tricky, --Jesper
Hi Jesper, On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote: > > > On 25/11/2024 16.12, Alexander Lobakin wrote: > > From: Daniel Xu <dxu@dxuuu.xyz> > > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > > > > Hi Olek, > > > > > > Here are the results. > > > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > > > > > > > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > > > [...] > > > > > Baseline (again) > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > > > > We need to talk about what we are measuring, and how to control the > experiment setup to get reproducible results. > Especially controlling on what CPU cores our code paths are executing. > > In above "baseline" case, we have two processes/tasks executing: > (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) > (2) Userspace netserver process TCP receiving data from socket. "baseline" in this case is still cpumap, just without these GRO patches. > > My experience is that you will see two noticeable different > throughput performance results depending on whether (1) and (2) is > executing on the *same* CPU (multi-tasking context-switching), > or executing in parallel (e.g. pinned) on two different CPU cores. > > The netperf command have an option > > -T lcpu,remcpu > Request that netperf be bound to local CPU lcpu and/or netserver be > bound to remote CPU rcpu. > > Verify setting by listing pinning like this: > for PID in $(pidof netserver); do taskset -pc $PID ; done > > You can also set pinning runtime like this: > export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID; > done > > For troubleshooting, I like to use the periodic 1 sec (netperf -D1) > output and adjust pinning runtime to observe the effect quickly. > > My experience is unfortunately that TCP results have a lot of variation > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks > timing, that can get affected by CPU sleep states. The systems CPU > latency setting can be seen in /dev/cpu_dma_latency, which can be read > like this: > > sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency > > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm > as it requires holding the file open. E.g I play with these profiles: > > sudo tuned-adm profile throughput-performance > sudo tuned-adm profile latency-performance > sudo tuned-adm profile network-latency Appreciate the tips - I should keep this saved somewhere. > > > > > cpumap v2 Olek > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > > > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > > > > > > > We now three processes/tasks executing: > (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) > (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) > (3) Userspace netserver process TCP receiving data from socket. > > Again, now the performance is going to depend on depending on which CPU > cores the processes/tasks are running and whether some are sharing the > same CPU. (There are both wakeup timing and cache-line effects). > > There are now more combinations to test... > > CPUmap is a CPU scaling facility, and you will likely also see different > CPU utilization on the difference cores one you start to pin these to > control the scenarios. > > > > It's very interesting that we see -40% tput w/ the patches. I went back > > > > Sad that we see -40% throughput... but do we know what CPU cores the > now three different tasks/processes run on(?) > Roughly, yes. For context, my primary use case for cpumap is to provide some degree of isolation between colocated containers on a single host. In particular, colocation occurs on AMD Bergamo. And containers are CPU pinned to their own CCX (roughly). My RX steering program ensures RX packets destined to a specific container are cpumap redirected to any of the container's pinned CPUs. It not only provides a good measure of isolation but ensures resources are properly accounted. So to answer your question of which CPUs the 3 things run on: cpumap kthread and application run on the same set of cores. More than that, they share the same L3 cache by design. irq/softirq is effectively random given default RSS config and IRQ affinities. > > > Oh no, I messed up something =\ > > > Could you please also test not the whole series, but patches 1-3 (up to > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > > array...")? Would be great to see whether this implementation works > > worse right from the start or I just broke something later on. > > > > > and double checked and it seems the numbers are right. Here's the > > > some output from some profiles I took with: > > > > > > perf record -e cycles:k -a -- sleep 10 > > > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > > > > > # Event 'cycles:k' > > > # Baseline Delta Abs Shared Object Symbol > > > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > > > > I really appreciate that you provide perf data and perf diff, but as > described above, we need data and information on what CPU cores are > running which workload. > > Fortunately perf diff (and perf report) support doing like this: > perf diff --sort=cpu,symbol > > But then you also need to control the CPUs used in experiment for the > diff to work. > > I hope I made sense as these kind of CPU scaling benchmarks are tricky, Indeed, sounds quite tricky. My understanding with GRO is that it's a powerful general purpose optimization. Enough that it should rise above the usual noise on a reasonably configured system (which mine is). Maybe we can consider decoupling the cpumap GRO enablement with the later optimizations? So in Olek's above series, patches 1-3 seem like they would still benefit from an simpler testbed. But the more targetted optimizations in patch 4+ would probably justify a de-noised setup. Possibly single host with xdp-trafficgen or something. Procedurally speaking, maybe it would save some wasted effort if everyone agreed on the general approach before investing more time into finer optimizations built on top of the basic GRO support? Thanks, Daniel
> Hi Jesper, > > On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote: > > > > > > On 25/11/2024 16.12, Alexander Lobakin wrote: > > > From: Daniel Xu <dxu@dxuuu.xyz> > > > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > > > > > > Hi Olek, > > > > > > > > Here are the results. > > > > > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > > > > > > > > > > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > > > > > [...] > > > > > > > Baseline (again) > > > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > > > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > > > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > > > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > > > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > > > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > > > > > > > We need to talk about what we are measuring, and how to control the > > experiment setup to get reproducible results. > > Especially controlling on what CPU cores our code paths are executing. > > > > In above "baseline" case, we have two processes/tasks executing: > > (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) > > (2) Userspace netserver process TCP receiving data from socket. > > "baseline" in this case is still cpumap, just without these GRO patches. > > > > > My experience is that you will see two noticeable different > > throughput performance results depending on whether (1) and (2) is > > executing on the *same* CPU (multi-tasking context-switching), > > or executing in parallel (e.g. pinned) on two different CPU cores. > > > > The netperf command have an option > > > > -T lcpu,remcpu > > Request that netperf be bound to local CPU lcpu and/or netserver be > > bound to remote CPU rcpu. > > > > Verify setting by listing pinning like this: > > for PID in $(pidof netserver); do taskset -pc $PID ; done > > > > You can also set pinning runtime like this: > > export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID; > > done > > > > For troubleshooting, I like to use the periodic 1 sec (netperf -D1) > > output and adjust pinning runtime to observe the effect quickly. > > > > My experience is unfortunately that TCP results have a lot of variation > > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks > > timing, that can get affected by CPU sleep states. The systems CPU > > latency setting can be seen in /dev/cpu_dma_latency, which can be read > > like this: > > > > sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency > > > > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm > > as it requires holding the file open. E.g I play with these profiles: > > > > sudo tuned-adm profile throughput-performance > > sudo tuned-adm profile latency-performance > > sudo tuned-adm profile network-latency > > Appreciate the tips - I should keep this saved somewhere. > > > > > > > > > cpumap v2 Olek > > > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > > > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > > > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > > > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > > > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > > > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > > > > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > > > > > > > > > > > > We now three processes/tasks executing: > > (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) > > (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) > > (3) Userspace netserver process TCP receiving data from socket. > > > > Again, now the performance is going to depend on depending on which CPU > > cores the processes/tasks are running and whether some are sharing the > > same CPU. (There are both wakeup timing and cache-line effects). > > > > There are now more combinations to test... > > > > CPUmap is a CPU scaling facility, and you will likely also see different > > CPU utilization on the difference cores one you start to pin these to > > control the scenarios. > > > > > > It's very interesting that we see -40% tput w/ the patches. I went back > > > > > > > Sad that we see -40% throughput... but do we know what CPU cores the > > now three different tasks/processes run on(?) > > > > Roughly, yes. For context, my primary use case for cpumap is to provide > some degree of isolation between colocated containers on a single host. > In particular, colocation occurs on AMD Bergamo. And containers are > CPU pinned to their own CCX (roughly). My RX steering program ensures > RX packets destined to a specific container are cpumap redirected to any > of the container's pinned CPUs. It not only provides a good measure of > isolation but ensures resources are properly accounted. > > So to answer your question of which CPUs the 3 things run on: cpumap > kthread and application run on the same set of cores. More than that, > they share the same L3 cache by design. irq/softirq is effectively > random given default RSS config and IRQ affinities. > > > > > > > Oh no, I messed up something =\ > > > > Could you please also test not the whole series, but patches 1-3 (up to > > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > > > array...")? Would be great to see whether this implementation works > > > worse right from the start or I just broke something later on. > > > > > > > and double checked and it seems the numbers are right. Here's the > > > > some output from some profiles I took with: > > > > > > > > perf record -e cycles:k -a -- sleep 10 > > > > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > > > > > > > # Event 'cycles:k' > > > > # Baseline Delta Abs Shared Object Symbol > > > > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > > > > > > > I really appreciate that you provide perf data and perf diff, but as > > described above, we need data and information on what CPU cores are > > running which workload. > > > > Fortunately perf diff (and perf report) support doing like this: > > perf diff --sort=cpu,symbol > > > > But then you also need to control the CPUs used in experiment for the > > diff to work. > > > > I hope I made sense as these kind of CPU scaling benchmarks are tricky, > > Indeed, sounds quite tricky. > > My understanding with GRO is that it's a powerful general purpose > optimization. Enough that it should rise above the usual noise on a > reasonably configured system (which mine is). > > Maybe we can consider decoupling the cpumap GRO enablement with the > later optimizations? I agree. First, we need to identify the best approach to enable GRO on cpumap (between Olek's approach and what I have suggested) and then we can evaluate subsequent optimizations. @Olek: do you agree? Regards, Lorenzo > > So in Olek's above series, patches 1-3 seem like they would still > benefit from an simpler testbed. But the more targetted optimizations in > patch 4+ would probably justify a de-noised setup. Possibly single host > with xdp-trafficgen or something. > > Procedurally speaking, maybe it would save some wasted effort if > everyone agreed on the general approach before investing more time into > finer optimizations built on top of the basic GRO support? > > Thanks, > Daniel >
On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Fri, 22 Nov 2024 17:10:06 -0700 > >> Hi Olek, >> >> Here are the results. >> >> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>> >>> >>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > [...] > >> Baseline (again) >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 >> Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 >> Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 >> Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 >> Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 >> Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 >> >> cpumap v2 Olek >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 >> Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 >> Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 >> Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 >> Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 >> Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 >> Delta 0.92% -0.53% 0.33% 0.85% -41.32% >> >> >> It's very interesting that we see -40% tput w/ the patches. I went back > > Oh no, I messed up something =\ > > Could you please also test not the whole series, but patches 1-3 (up to > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > array...")? Would be great to see whether this implementation works > worse right from the start or I just broke something later on. Patches 1-3 reproduces the -40% tput numbers. With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy. tcp_rr results were unaffected. Thanks, Daniel
From: Daniel Xu <dxu@dxuuu.xyz> Date: Mon, 25 Nov 2024 16:56:49 -0600 > > > On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: >> From: Daniel Xu <dxu@dxuuu.xyz> >> Date: Fri, 22 Nov 2024 17:10:06 -0700 >> >>> Hi Olek, >>> >>> Here are the results. >>> >>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>>> >>>> >>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: >> >> [...] >> >>> Baseline (again) >>> >>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>> Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 >>> Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 >>> Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 >>> Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 >>> Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 >>> Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 >>> >>> cpumap v2 Olek >>> >>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>> Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 >>> Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 >>> Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 >>> Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 >>> Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 >>> Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 >>> Delta 0.92% -0.53% 0.33% 0.85% -41.32% >>> >>> >>> It's very interesting that we see -40% tput w/ the patches. I went back >> >> Oh no, I messed up something =\ >> >> Could you please also test not the whole series, but patches 1-3 (up to >> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb >> array...")? Would be great to see whether this implementation works >> worse right from the start or I just broke something later on. > > Patches 1-3 reproduces the -40% tput numbers. Ok, thanks! Seems like using the hybrid approach (GRO, but on top of cpumap's kthreads instead of NAPI) really performs worse than switching cpumap to NAPI. > > With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy. Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it. > > tcp_rr results were unaffected. @ Jakub, Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at least for now =\ I took a look on the backlog NAPI and it could be used, although we'd need a pointer in the backlog to the corresponding cpumap + also some synchronization point to make sure backlog NAPI won't access already destroyed cpumap. Maybe Lorenzo could take a look... Thanks, Olek
> From: Daniel Xu <dxu@dxuuu.xyz> > Date: Mon, 25 Nov 2024 16:56:49 -0600 > > > > > > > On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: > >> From: Daniel Xu <dxu@dxuuu.xyz> > >> Date: Fri, 22 Nov 2024 17:10:06 -0700 > >> > >>> Hi Olek, > >>> > >>> Here are the results. > >>> > >>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > >>>> > >>>> > >>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > >> > >> [...] > >> > >>> Baseline (again) > >>> > >>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>> Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > >>> Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > >>> Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > >>> Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > >>> Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > >>> Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > >>> > >>> cpumap v2 Olek > >>> > >>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>> Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > >>> Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > >>> Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > >>> Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > >>> Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > >>> Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > >>> Delta 0.92% -0.53% 0.33% 0.85% -41.32% > >>> > >>> > >>> It's very interesting that we see -40% tput w/ the patches. I went back > >> > >> Oh no, I messed up something =\ > >> > >> Could you please also test not the whole series, but patches 1-3 (up to > >> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > >> array...")? Would be great to see whether this implementation works > >> worse right from the start or I just broke something later on. > > > > Patches 1-3 reproduces the -40% tput numbers. > > Ok, thanks! Seems like using the hybrid approach (GRO, but on top of > cpumap's kthreads instead of NAPI) really performs worse than switching > cpumap to NAPI. > > > > > With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy. > > Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it. > > > > > tcp_rr results were unaffected. > > @ Jakub, > > Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at > least for now =\ I took a look on the backlog NAPI and it could be used, > although we'd need a pointer in the backlog to the corresponding cpumap > + also some synchronization point to make sure backlog NAPI won't access > already destroyed cpumap. > > Maybe Lorenzo could take a look... it seems to me the only difference would be we will use the shared backlog_napi kthreads instead of having a dedicated kthread for each cpumap entry but we still need the napi poll logic. I can look into it if you prefer the shared kthread approach. @Jakub: what do you think? Regards, Lorenzo > > Thanks, > Olek >
On 26/11/2024 18.02, Lorenzo Bianconi wrote: >> From: Daniel Xu <dxu@dxuuu.xyz> >> Date: Mon, 25 Nov 2024 16:56:49 -0600 >> >>> >>> >>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: >>>> From: Daniel Xu <dxu@dxuuu.xyz> >>>> Date: Fri, 22 Nov 2024 17:10:06 -0700 >>>> >>>>> Hi Olek, >>>>> >>>>> Here are the results. >>>>> >>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>>>>> >>>>>> >>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: >>>> >>>> [...] >>>> >>>>> Baseline (again) >>>>> >>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>> Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 >>>>> Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 >>>>> Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 >>>>> Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 >>>>> Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 >>>>> Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 >>>>> >>>>> cpumap v2 Olek >>>>> >>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>> Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 >>>>> Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 >>>>> Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 >>>>> Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 >>>>> Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 >>>>> Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 >>>>> Delta 0.92% -0.53% 0.33% 0.85% -41.32% >>>>> >>>>> >>>>> It's very interesting that we see -40% tput w/ the patches. I went back >>>> >>>> Oh no, I messed up something =\ >>>> >>>> Could you please also test not the whole series, but patches 1-3 (up to >>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb >>>> array...")? Would be great to see whether this implementation works >>>> worse right from the start or I just broke something later on. >>> >>> Patches 1-3 reproduces the -40% tput numbers. >> >> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of >> cpumap's kthreads instead of NAPI) really performs worse than switching >> cpumap to NAPI. >> >>> >>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy. >> >> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it. >> >>> >>> tcp_rr results were unaffected. >> >> @ Jakub, >> >> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at >> least for now =\ I took a look on the backlog NAPI and it could be used, >> although we'd need a pointer in the backlog to the corresponding cpumap >> + also some synchronization point to make sure backlog NAPI won't access >> already destroyed cpumap. >> >> Maybe Lorenzo could take a look... > > it seems to me the only difference would be we will use the shared backlog_napi > kthreads instead of having a dedicated kthread for each cpumap entry but we still > need the napi poll logic. I can look into it if you prefer the shared kthread > approach. I don't like a shared kthread approach. For my use-case I want to give the "remote" CPU-map kthreads higher scheduling priority. (As it will be running a 2nd XDP BPF DDoS program protecting against overload by dropping packets). Thus, I'm not a fan of using the shared backlog_napi. As I don't want to give backlog NAPI high priority, in my use-case. > @Jakub: what do you think? --Jesper
From: Jesper Dangaard Brouer <hawk@kernel.org> Date: Tue, 26 Nov 2024 18:12:27 +0100 > > > > On 26/11/2024 18.02, Lorenzo Bianconi wrote: >>> From: Daniel Xu <dxu@dxuuu.xyz> >>> Date: Mon, 25 Nov 2024 16:56:49 -0600 >>> >>>> >>>> >>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: >>>>> From: Daniel Xu <dxu@dxuuu.xyz> >>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700 >>>>> >>>>>> Hi Olek, >>>>>> >>>>>> Here are the results. >>>>>> >>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>>>>>> >>>>>>> >>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: >>>>> >>>>> [...] >>>>> >>>>>> Baseline (again) >>>>>> >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency >>>>>> P99 (s) Throughput (Mbit/s) >>>>>> Run 1 3169917 0.00007295 0.00007871 >>>>>> 0.00009343 Run 1 21749.43 >>>>>> Run 2 3228290 0.00007103 0.00007679 >>>>>> 0.00009215 Run 2 21897.17 >>>>>> Run 3 3226746 0.00007231 0.00007871 >>>>>> 0.00009087 Run 3 21906.82 >>>>>> Run 4 3191258 0.00007231 0.00007743 >>>>>> 0.00009087 Run 4 21155.15 >>>>>> Run 5 3235653 0.00007231 0.00007743 >>>>>> 0.00008703 Run 5 21397.06 >>>>>> Average 3210372.8 0.000072182 0.000077814 >>>>>> 0.00009087 Average 21621.126 >>>>>> >>>>>> cpumap v2 Olek >>>>>> >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency >>>>>> P99 (s) Throughput (Mbit/s) >>>>>> Run 1 3253651 0.00007167 0.00007807 >>>>>> 0.00009343 Run 1 13497.57 >>>>>> Run 2 3221492 0.00007231 0.00007743 >>>>>> 0.00009087 Run 2 12115.53 >>>>>> Run 3 3296453 0.00007039 0.00007807 >>>>>> 0.00009087 Run 3 12323.38 >>>>>> Run 4 3254460 0.00007167 0.00007807 >>>>>> 0.00009087 Run 4 12901.88 >>>>>> Run 5 3173327 0.00007295 0.00007871 >>>>>> 0.00009215 Run 5 12593.22 >>>>>> Average 3239876.6 0.000071798 0.00007807 >>>>>> 0.000091638 Average 12686.316 >>>>>> Delta 0.92% -0.53% 0.33% >>>>>> 0.85% -41.32% >>>>>> >>>>>> >>>>>> It's very interesting that we see -40% tput w/ the patches. I went >>>>>> back >>>>> >>>>> Oh no, I messed up something =\ >>>>> >>>>> Could you please also test not the whole series, but patches 1-3 >>>>> (up to >>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb >>>>> array...")? Would be great to see whether this implementation works >>>>> worse right from the start or I just broke something later on. >>>> >>>> Patches 1-3 reproduces the -40% tput numbers. >>> >>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of >>> cpumap's kthreads instead of NAPI) really performs worse than switching >>> cpumap to NAPI. >>> >>>> >>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but >>>> it was noisy. >>> >>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up >>> on it. >>> >>>> >>>> tcp_rr results were unaffected. >>> >>> @ Jakub, >>> >>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at >>> least for now =\ I took a look on the backlog NAPI and it could be used, >>> although we'd need a pointer in the backlog to the corresponding cpumap >>> + also some synchronization point to make sure backlog NAPI won't access >>> already destroyed cpumap. >>> >>> Maybe Lorenzo could take a look... >> >> it seems to me the only difference would be we will use the shared >> backlog_napi >> kthreads instead of having a dedicated kthread for each cpumap entry >> but we still >> need the napi poll logic. I can look into it if you prefer the shared >> kthread >> approach. > > I don't like a shared kthread approach. For my use-case I want to give > the "remote" CPU-map kthreads higher scheduling priority. (As it will be > running a 2nd XDP BPF DDoS program protecting against overload by > dropping packets). Oh, that is also valid. Let's see what Jakub replies, for now I'm leaning towards posting approach from this RFC with my bulk allocation from the NAPI cache. > > Thus, I'm not a fan of using the shared backlog_napi. As I don't want > to give backlog NAPI high priority, in my use-case. > >> @Jakub: what do you think? > > > --Jesper Thanks, Olek
> From: Jesper Dangaard Brouer <hawk@kernel.org> > Date: Tue, 26 Nov 2024 18:12:27 +0100 > > > > > > > > > On 26/11/2024 18.02, Lorenzo Bianconi wrote: > >>> From: Daniel Xu <dxu@dxuuu.xyz> > >>> Date: Mon, 25 Nov 2024 16:56:49 -0600 > >>> > >>>> > >>>> > >>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: > >>>>> From: Daniel Xu <dxu@dxuuu.xyz> > >>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700 > >>>>> > >>>>>> Hi Olek, > >>>>>> > >>>>>> Here are the results. > >>>>>> > >>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > >>>>> > >>>>> [...] > >>>>> > >>>>>> Baseline (again) > >>>>>> > >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency > >>>>>> P99 (s) Throughput (Mbit/s) > >>>>>> Run 1 3169917 0.00007295 0.00007871 > >>>>>> 0.00009343 Run 1 21749.43 > >>>>>> Run 2 3228290 0.00007103 0.00007679 > >>>>>> 0.00009215 Run 2 21897.17 > >>>>>> Run 3 3226746 0.00007231 0.00007871 > >>>>>> 0.00009087 Run 3 21906.82 > >>>>>> Run 4 3191258 0.00007231 0.00007743 > >>>>>> 0.00009087 Run 4 21155.15 > >>>>>> Run 5 3235653 0.00007231 0.00007743 > >>>>>> 0.00008703 Run 5 21397.06 > >>>>>> Average 3210372.8 0.000072182 0.000077814 > >>>>>> 0.00009087 Average 21621.126 > >>>>>> > >>>>>> cpumap v2 Olek > >>>>>> > >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency > >>>>>> P99 (s) Throughput (Mbit/s) > >>>>>> Run 1 3253651 0.00007167 0.00007807 > >>>>>> 0.00009343 Run 1 13497.57 > >>>>>> Run 2 3221492 0.00007231 0.00007743 > >>>>>> 0.00009087 Run 2 12115.53 > >>>>>> Run 3 3296453 0.00007039 0.00007807 > >>>>>> 0.00009087 Run 3 12323.38 > >>>>>> Run 4 3254460 0.00007167 0.00007807 > >>>>>> 0.00009087 Run 4 12901.88 > >>>>>> Run 5 3173327 0.00007295 0.00007871 > >>>>>> 0.00009215 Run 5 12593.22 > >>>>>> Average 3239876.6 0.000071798 0.00007807 > >>>>>> 0.000091638 Average 12686.316 > >>>>>> Delta 0.92% -0.53% 0.33% > >>>>>> 0.85% -41.32% > >>>>>> > >>>>>> > >>>>>> It's very interesting that we see -40% tput w/ the patches. I went > >>>>>> back > >>>>> > >>>>> Oh no, I messed up something =\ > >>>>> > >>>>> Could you please also test not the whole series, but patches 1-3 > >>>>> (up to > >>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > >>>>> array...")? Would be great to see whether this implementation works > >>>>> worse right from the start or I just broke something later on. > >>>> > >>>> Patches 1-3 reproduces the -40% tput numbers. > >>> > >>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of > >>> cpumap's kthreads instead of NAPI) really performs worse than switching > >>> cpumap to NAPI. > >>> > >>>> > >>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but > >>>> it was noisy. > >>> > >>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up > >>> on it. > >>> > >>>> > >>>> tcp_rr results were unaffected. > >>> > >>> @ Jakub, > >>> > >>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at > >>> least for now =\ I took a look on the backlog NAPI and it could be used, > >>> although we'd need a pointer in the backlog to the corresponding cpumap > >>> + also some synchronization point to make sure backlog NAPI won't access > >>> already destroyed cpumap. > >>> > >>> Maybe Lorenzo could take a look... > >> > >> it seems to me the only difference would be we will use the shared > >> backlog_napi > >> kthreads instead of having a dedicated kthread for each cpumap entry > >> but we still > >> need the napi poll logic. I can look into it if you prefer the shared > >> kthread > >> approach. > > > > I don't like a shared kthread approach. For my use-case I want to give > > the "remote" CPU-map kthreads higher scheduling priority. (As it will be > > running a 2nd XDP BPF DDoS program protecting against overload by > > dropping packets). > > Oh, that is also valid. > Let's see what Jakub replies, for now I'm leaning towards posting > approach from this RFC with my bulk allocation from the NAPI cache. I guess it would be better to keep them separated to check what are the effects of each change (GRO for cpumap and bulk allocation). I guess you can post your changes on top of mine if we all agree the proposed approach is fine. What do you think? Regards, Lorenzo > > > > > Thus, I'm not a fan of using the shared backlog_napi. As I don't want > > to give backlog NAPI high priority, in my use-case. > > > >> @Jakub: what do you think? > > > > > > --Jesper > > Thanks, > Olek
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Thu, 28 Nov 2024 11:56:24 +0100 >> From: Jesper Dangaard Brouer <hawk@kernel.org> >> Date: Tue, 26 Nov 2024 18:12:27 +0100 >> >>> >>> >>> >>> On 26/11/2024 18.02, Lorenzo Bianconi wrote: >>>>> From: Daniel Xu <dxu@dxuuu.xyz> >>>>> Date: Mon, 25 Nov 2024 16:56:49 -0600 >>>>> >>>>>> >>>>>> >>>>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: >>>>>>> From: Daniel Xu <dxu@dxuuu.xyz> >>>>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700 >>>>>>> >>>>>>>> Hi Olek, >>>>>>>> >>>>>>>> Here are the results. >>>>>>>> >>>>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: >>>>>>> >>>>>>> [...] >>>>>>> >>>>>>>> Baseline (again) >>>>>>>> >>>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency >>>>>>>> P99 (s) Throughput (Mbit/s) >>>>>>>> Run 1 3169917 0.00007295 0.00007871 >>>>>>>> 0.00009343 Run 1 21749.43 >>>>>>>> Run 2 3228290 0.00007103 0.00007679 >>>>>>>> 0.00009215 Run 2 21897.17 >>>>>>>> Run 3 3226746 0.00007231 0.00007871 >>>>>>>> 0.00009087 Run 3 21906.82 >>>>>>>> Run 4 3191258 0.00007231 0.00007743 >>>>>>>> 0.00009087 Run 4 21155.15 >>>>>>>> Run 5 3235653 0.00007231 0.00007743 >>>>>>>> 0.00008703 Run 5 21397.06 >>>>>>>> Average 3210372.8 0.000072182 0.000077814 >>>>>>>> 0.00009087 Average 21621.126 >>>>>>>> >>>>>>>> cpumap v2 Olek >>>>>>>> >>>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency >>>>>>>> P99 (s) Throughput (Mbit/s) >>>>>>>> Run 1 3253651 0.00007167 0.00007807 >>>>>>>> 0.00009343 Run 1 13497.57 >>>>>>>> Run 2 3221492 0.00007231 0.00007743 >>>>>>>> 0.00009087 Run 2 12115.53 >>>>>>>> Run 3 3296453 0.00007039 0.00007807 >>>>>>>> 0.00009087 Run 3 12323.38 >>>>>>>> Run 4 3254460 0.00007167 0.00007807 >>>>>>>> 0.00009087 Run 4 12901.88 >>>>>>>> Run 5 3173327 0.00007295 0.00007871 >>>>>>>> 0.00009215 Run 5 12593.22 >>>>>>>> Average 3239876.6 0.000071798 0.00007807 >>>>>>>> 0.000091638 Average 12686.316 >>>>>>>> Delta 0.92% -0.53% 0.33% >>>>>>>> 0.85% -41.32% >>>>>>>> >>>>>>>> >>>>>>>> It's very interesting that we see -40% tput w/ the patches. I went >>>>>>>> back >>>>>>> >>>>>>> Oh no, I messed up something =\ >>>>>>> >>>>>>> Could you please also test not the whole series, but patches 1-3 >>>>>>> (up to >>>>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb >>>>>>> array...")? Would be great to see whether this implementation works >>>>>>> worse right from the start or I just broke something later on. >>>>>> >>>>>> Patches 1-3 reproduces the -40% tput numbers. >>>>> >>>>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of >>>>> cpumap's kthreads instead of NAPI) really performs worse than switching >>>>> cpumap to NAPI. >>>>> >>>>>> >>>>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but >>>>>> it was noisy. >>>>> >>>>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up >>>>> on it. >>>>> >>>>>> >>>>>> tcp_rr results were unaffected. >>>>> >>>>> @ Jakub, >>>>> >>>>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at >>>>> least for now =\ I took a look on the backlog NAPI and it could be used, >>>>> although we'd need a pointer in the backlog to the corresponding cpumap >>>>> + also some synchronization point to make sure backlog NAPI won't access >>>>> already destroyed cpumap. >>>>> >>>>> Maybe Lorenzo could take a look... >>>> >>>> it seems to me the only difference would be we will use the shared >>>> backlog_napi >>>> kthreads instead of having a dedicated kthread for each cpumap entry >>>> but we still >>>> need the napi poll logic. I can look into it if you prefer the shared >>>> kthread >>>> approach. >>> >>> I don't like a shared kthread approach. For my use-case I want to give >>> the "remote" CPU-map kthreads higher scheduling priority. (As it will be >>> running a 2nd XDP BPF DDoS program protecting against overload by >>> dropping packets). >> >> Oh, that is also valid. >> Let's see what Jakub replies, for now I'm leaning towards posting >> approach from this RFC with my bulk allocation from the NAPI cache. > > I guess it would be better to keep them separated to check what are the effects > of each change (GRO for cpumap and bulk allocation). I guess you can post your > changes on top of mine if we all agree the proposed approach is fine. > What do you think? Sounds good as well, I don't have any preference here. > > Regards, > Lorenzo Thanks, Olek
On Tue, 26 Nov 2024 11:36:53 +0100 Alexander Lobakin wrote: > > tcp_rr results were unaffected. > > @ Jakub, Context? What doesn't work and why?
From: Jakub Kicinski <kuba@kernel.org> Date: Mon, 2 Dec 2024 14:47:39 -0800 > On Tue, 26 Nov 2024 11:36:53 +0100 Alexander Lobakin wrote: >>> tcp_rr results were unaffected. >> >> @ Jakub, > > Context? What doesn't work and why? My tests show the same perf as on Lorenzo's series, but I test with UDP trafficgen. Daniel tests TCP and the results are much worse than with Lorenzo's implementation. I suspect this is related to that how NAPI performs flushes / decides whether to repoll again or exit vs how kthread does that (even though I also try to flush only every 64 frames or when the ring is empty). Or maybe to that part of the kthread happens in process context outside any softirq, while when using NAPI, the whole loop is inside RX softirq. Jesper said that he'd like to see cpumap still using own kthread, so that its priority can be boosted separately from the backlog. That's why we asked you whether it would be fine to have cpumap as threaded NAPI in regards to all this :D Thanks, Olek
On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: > >> @ Jakub, > > > > Context? What doesn't work and why? > > My tests show the same perf as on Lorenzo's series, but I test with UDP > trafficgen. Daniel tests TCP and the results are much worse than with > Lorenzo's implementation. > I suspect this is related to that how NAPI performs flushes / decides > whether to repoll again or exit vs how kthread does that (even though I > also try to flush only every 64 frames or when the ring is empty). Or > maybe to that part of the kthread happens in process context outside any > softirq, while when using NAPI, the whole loop is inside RX softirq. > > Jesper said that he'd like to see cpumap still using own kthread, so > that its priority can be boosted separately from the backlog. That's why > we asked you whether it would be fine to have cpumap as threaded NAPI in > regards to all this :D Certainly not without a clear understanding what the problem with a kthread is.
From: Jakub Kicinski <kuba@kernel.org> Date: Tue, 3 Dec 2024 16:51:57 -0800 > On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>> @ Jakub, >>> >>> Context? What doesn't work and why? >> >> My tests show the same perf as on Lorenzo's series, but I test with UDP >> trafficgen. Daniel tests TCP and the results are much worse than with >> Lorenzo's implementation. >> I suspect this is related to that how NAPI performs flushes / decides >> whether to repoll again or exit vs how kthread does that (even though I >> also try to flush only every 64 frames or when the ring is empty). Or >> maybe to that part of the kthread happens in process context outside any >> softirq, while when using NAPI, the whole loop is inside RX softirq. >> >> Jesper said that he'd like to see cpumap still using own kthread, so >> that its priority can be boosted separately from the backlog. That's why >> we asked you whether it would be fine to have cpumap as threaded NAPI in >> regards to all this :D > > Certainly not without a clear understanding what the problem with > a kthread is. Yes, sure thing. Bad thing's that I can't reproduce Daniel's problem >_< Previously, I was testing with the UDP trafficgen and got up to 80% improvement over the baseline. Now I tested TCP and got up to 70% improvement, no regressions whatsoever =\ I don't know where this regression on Daniel's setup comes from. Is it multi-thread or single-thread test? What app do you use: iperf, netperf, neper, Microsoft's app (forgot the name)? Do you have multiple NUMA nodes on your system, are you sure you didn't cross the node when redirecting with the GRO patches / no other NUMA mismatches happened? Some other random stuff like RSS hash key, which affects flow steering? Thanks, Olek
On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: > From: Jakub Kicinski <kuba@kernel.org> > Date: Tue, 3 Dec 2024 16:51:57 -0800 > >> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>>> @ Jakub, >>>> >>>> Context? What doesn't work and why? >>> >>> My tests show the same perf as on Lorenzo's series, but I test with UDP >>> trafficgen. Daniel tests TCP and the results are much worse than with >>> Lorenzo's implementation. >>> I suspect this is related to that how NAPI performs flushes / decides >>> whether to repoll again or exit vs how kthread does that (even though I >>> also try to flush only every 64 frames or when the ring is empty). Or >>> maybe to that part of the kthread happens in process context outside any >>> softirq, while when using NAPI, the whole loop is inside RX softirq. >>> >>> Jesper said that he'd like to see cpumap still using own kthread, so >>> that its priority can be boosted separately from the backlog. That's why >>> we asked you whether it would be fine to have cpumap as threaded NAPI in >>> regards to all this :D >> >> Certainly not without a clear understanding what the problem with >> a kthread is. > > Yes, sure thing. > > Bad thing's that I can't reproduce Daniel's problem >_< Previously, I > was testing with the UDP trafficgen and got up to 80% improvement over > the baseline. Now I tested TCP and got up to 70% improvement, no > regressions whatsoever =\ > > I don't know where this regression on Daniel's setup comes from. Is it > multi-thread or single-thread test? 8 threads with 16 flows over them (-T8 -F16) > What app do you use: iperf, netperf, > neper, Microsoft's app (forgot the name)? neper, tcp_stream. > Do you have multiple NUMA > nodes on your system, are you sure you didn't cross the node when > redirecting with the GRO patches / no other NUMA mismatches happened? Single node. Technically EPYC NPS=1. So there are some numa characteristics but I think the interconnect is supposed to hide it fairly efficiently. > Some other random stuff like RSS hash key, which affects flow steering? Whatever the default is - I'd be willing to be Kuba set up the configuration at one point or another so it's probably sane. And with 5 runs it seems unlikely the hashing would get unlucky and cause an imbalance. > > Thanks, > Olek Since I've got the setup handy and am motivated to see this work land, do you have any other pointers for things I should look for? I'll spend some time looking at profiles to see if I can identify any hot spots compared to softirq based GRO. Thanks, Daniel
From: Daniel Xu <dxu@dxuuu.xyz> Date: Wed, 04 Dec 2024 13:51:08 -0800 > > > On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: >> From: Jakub Kicinski <kuba@kernel.org> >> Date: Tue, 3 Dec 2024 16:51:57 -0800 >> >>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>>>> @ Jakub, >>>>> >>>>> Context? What doesn't work and why? >>>> >>>> My tests show the same perf as on Lorenzo's series, but I test with UDP >>>> trafficgen. Daniel tests TCP and the results are much worse than with >>>> Lorenzo's implementation. >>>> I suspect this is related to that how NAPI performs flushes / decides >>>> whether to repoll again or exit vs how kthread does that (even though I >>>> also try to flush only every 64 frames or when the ring is empty). Or >>>> maybe to that part of the kthread happens in process context outside any >>>> softirq, while when using NAPI, the whole loop is inside RX softirq. >>>> >>>> Jesper said that he'd like to see cpumap still using own kthread, so >>>> that its priority can be boosted separately from the backlog. That's why >>>> we asked you whether it would be fine to have cpumap as threaded NAPI in >>>> regards to all this :D >>> >>> Certainly not without a clear understanding what the problem with >>> a kthread is. >> >> Yes, sure thing. >> >> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I >> was testing with the UDP trafficgen and got up to 80% improvement over >> the baseline. Now I tested TCP and got up to 70% improvement, no >> regressions whatsoever =\ >> >> I don't know where this regression on Daniel's setup comes from. Is it >> multi-thread or single-thread test? > > 8 threads with 16 flows over them (-T8 -F16) > >> What app do you use: iperf, netperf, >> neper, Microsoft's app (forgot the name)? > > neper, tcp_stream. Let me recheck with neper -T8 -F16, I'll post my results soon. > >> Do you have multiple NUMA >> nodes on your system, are you sure you didn't cross the node when >> redirecting with the GRO patches / no other NUMA mismatches happened? > > Single node. Technically EPYC NPS=1. So there are some numa characteristics > but I think the interconnect is supposed to hide it fairly efficiently. > >> Some other random stuff like RSS hash key, which affects flow steering? > > Whatever the default is - I'd be willing to be Kuba set up the configuration > at one point or another so it's probably sane. And with 5 runs it seems > unlikely the hashing would get unlucky and cause an imbalance. > >> >> Thanks, >> Olek > > Since I've got the setup handy and am motivated to see this work land, > do you have any other pointers for things I should look for? I'll spend some > time looking at profiles to see if I can identify any hot spots compared to > softirq based GRO. > > Thanks, > Daniel Thanks for helping with this! Olek
From: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Thu, 5 Dec 2024 11:38:11 +0100 > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Wed, 04 Dec 2024 13:51:08 -0800 > >> >> >> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: >>> From: Jakub Kicinski <kuba@kernel.org> >>> Date: Tue, 3 Dec 2024 16:51:57 -0800 >>> >>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>>>>> @ Jakub, >>>>>> >>>>>> Context? What doesn't work and why? >>>>> >>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP >>>>> trafficgen. Daniel tests TCP and the results are much worse than with >>>>> Lorenzo's implementation. >>>>> I suspect this is related to that how NAPI performs flushes / decides >>>>> whether to repoll again or exit vs how kthread does that (even though I >>>>> also try to flush only every 64 frames or when the ring is empty). Or >>>>> maybe to that part of the kthread happens in process context outside any >>>>> softirq, while when using NAPI, the whole loop is inside RX softirq. >>>>> >>>>> Jesper said that he'd like to see cpumap still using own kthread, so >>>>> that its priority can be boosted separately from the backlog. That's why >>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in >>>>> regards to all this :D >>>> >>>> Certainly not without a clear understanding what the problem with >>>> a kthread is. >>> >>> Yes, sure thing. >>> >>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I >>> was testing with the UDP trafficgen and got up to 80% improvement over >>> the baseline. Now I tested TCP and got up to 70% improvement, no >>> regressions whatsoever =\ >>> >>> I don't know where this regression on Daniel's setup comes from. Is it >>> multi-thread or single-thread test? >> >> 8 threads with 16 flows over them (-T8 -F16) >> >>> What app do you use: iperf, netperf, >>> neper, Microsoft's app (forgot the name)? >> >> neper, tcp_stream. > > Let me recheck with neper -T8 -F16, I'll post my results soon. kernel direct T1 direct T8F16 cpumap cpumap T8F16 clean 28 51 13 9 Gbps GRO 28 51 26 18 Gbps 100% gain, no regressions =\ My XDP prog is simple (upstream xdp-tools repo with no changes): numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p no-touch ens802f0np0 IOW it simply redirects everything to CPU 23 (same NUMA node) from any Rx queue without looking into headers or packet. Do you test with more sophisticated XDP prog? Thanks, Olek
On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote: > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Thu, 5 Dec 2024 11:38:11 +0100 > > > From: Daniel Xu <dxu@dxuuu.xyz> > > Date: Wed, 04 Dec 2024 13:51:08 -0800 > > > >> > >> > >> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: > >>> From: Jakub Kicinski <kuba@kernel.org> > >>> Date: Tue, 3 Dec 2024 16:51:57 -0800 > >>> > >>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: > >>>>>>> @ Jakub, > >>>>>> > >>>>>> Context? What doesn't work and why? > >>>>> > >>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP > >>>>> trafficgen. Daniel tests TCP and the results are much worse than with > >>>>> Lorenzo's implementation. > >>>>> I suspect this is related to that how NAPI performs flushes / decides > >>>>> whether to repoll again or exit vs how kthread does that (even though I > >>>>> also try to flush only every 64 frames or when the ring is empty). Or > >>>>> maybe to that part of the kthread happens in process context outside any > >>>>> softirq, while when using NAPI, the whole loop is inside RX softirq. > >>>>> > >>>>> Jesper said that he'd like to see cpumap still using own kthread, so > >>>>> that its priority can be boosted separately from the backlog. That's why > >>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in > >>>>> regards to all this :D > >>>> > >>>> Certainly not without a clear understanding what the problem with > >>>> a kthread is. > >>> > >>> Yes, sure thing. > >>> > >>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I > >>> was testing with the UDP trafficgen and got up to 80% improvement over > >>> the baseline. Now I tested TCP and got up to 70% improvement, no > >>> regressions whatsoever =\ > >>> > >>> I don't know where this regression on Daniel's setup comes from. Is it > >>> multi-thread or single-thread test? > >> > >> 8 threads with 16 flows over them (-T8 -F16) > >> > >>> What app do you use: iperf, netperf, > >>> neper, Microsoft's app (forgot the name)? > >> > >> neper, tcp_stream. > > > > Let me recheck with neper -T8 -F16, I'll post my results soon. > > kernel direct T1 direct T8F16 cpumap cpumap T8F16 > clean 28 51 13 9 Gbps > GRO 28 51 26 18 Gbps > > 100% gain, no regressions =\ > > My XDP prog is simple (upstream xdp-tools repo with no changes): > > numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p > no-touch ens802f0np0 > > IOW it simply redirects everything to CPU 23 (same NUMA node) from any > Rx queue without looking into headers or packet. > Do you test with more sophisticated XDP prog? Great reminder... my prog is a bit more sophisticated. I forgot we were doing latency tracking by inserting a timestamp into frame metadata. But not clearing it after it was read on remote CPU, which disables GRO. So previous test was paying the penalty of fixed GRO overhead without getting any packet merges. Once I fixed up prog to reset metadata pointer I could see the wins. Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No latency changes. Sorry about the churn. Daniel
From: Daniel Xu <dxu@dxuuu.xyz> Date: Thu, 5 Dec 2024 17:41:27 -0700 > On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote: >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Date: Thu, 5 Dec 2024 11:38:11 +0100 >> >>> From: Daniel Xu <dxu@dxuuu.xyz> >>> Date: Wed, 04 Dec 2024 13:51:08 -0800 >>> >>>> >>>> >>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: >>>>> From: Jakub Kicinski <kuba@kernel.org> >>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800 >>>>> >>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>>>>>>> @ Jakub, >>>>>>>> >>>>>>>> Context? What doesn't work and why? >>>>>>> >>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP >>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with >>>>>>> Lorenzo's implementation. >>>>>>> I suspect this is related to that how NAPI performs flushes / decides >>>>>>> whether to repoll again or exit vs how kthread does that (even though I >>>>>>> also try to flush only every 64 frames or when the ring is empty). Or >>>>>>> maybe to that part of the kthread happens in process context outside any >>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq. >>>>>>> >>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so >>>>>>> that its priority can be boosted separately from the backlog. That's why >>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in >>>>>>> regards to all this :D >>>>>> >>>>>> Certainly not without a clear understanding what the problem with >>>>>> a kthread is. >>>>> >>>>> Yes, sure thing. >>>>> >>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I >>>>> was testing with the UDP trafficgen and got up to 80% improvement over >>>>> the baseline. Now I tested TCP and got up to 70% improvement, no >>>>> regressions whatsoever =\ >>>>> >>>>> I don't know where this regression on Daniel's setup comes from. Is it >>>>> multi-thread or single-thread test? >>>> >>>> 8 threads with 16 flows over them (-T8 -F16) >>>> >>>>> What app do you use: iperf, netperf, >>>>> neper, Microsoft's app (forgot the name)? >>>> >>>> neper, tcp_stream. >>> >>> Let me recheck with neper -T8 -F16, I'll post my results soon. >> >> kernel direct T1 direct T8F16 cpumap cpumap T8F16 >> clean 28 51 13 9 Gbps >> GRO 28 51 26 18 Gbps >> >> 100% gain, no regressions =\ >> >> My XDP prog is simple (upstream xdp-tools repo with no changes): >> >> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p >> no-touch ens802f0np0 >> >> IOW it simply redirects everything to CPU 23 (same NUMA node) from any >> Rx queue without looking into headers or packet. >> Do you test with more sophisticated XDP prog? > > Great reminder... my prog is a bit more sophisticated. I forgot we were > doing latency tracking by inserting a timestamp into frame metadata. But > not clearing it after it was read on remote CPU, which disables GRO. So > previous test was paying the penalty of fixed GRO overhead without > getting any packet merges. > > Once I fixed up prog to reset metadata pointer I could see the wins. > Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No > latency changes. > > Sorry about the churn. No problem, crap happens sometimes :) Let me send my implementation on Monday-Wednesday. I'll include my UDP and TCP test results, as well as yours (+18%). BTW would be great if you could give me a Tested-by tag, as I assume the tests were fine and it works for you? Thanks, Olek
On Fri, Dec 6, 2024, at 7:06 AM, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Thu, 5 Dec 2024 17:41:27 -0700 > >> On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote: >>> From: Alexander Lobakin <aleksander.lobakin@intel.com> >>> Date: Thu, 5 Dec 2024 11:38:11 +0100 >>> >>>> From: Daniel Xu <dxu@dxuuu.xyz> >>>> Date: Wed, 04 Dec 2024 13:51:08 -0800 >>>> >>>>> >>>>> >>>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: >>>>>> From: Jakub Kicinski <kuba@kernel.org> >>>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800 >>>>>> >>>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>>>>>>>> @ Jakub, >>>>>>>>> >>>>>>>>> Context? What doesn't work and why? >>>>>>>> >>>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP >>>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with >>>>>>>> Lorenzo's implementation. >>>>>>>> I suspect this is related to that how NAPI performs flushes / decides >>>>>>>> whether to repoll again or exit vs how kthread does that (even though I >>>>>>>> also try to flush only every 64 frames or when the ring is empty). Or >>>>>>>> maybe to that part of the kthread happens in process context outside any >>>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq. >>>>>>>> >>>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so >>>>>>>> that its priority can be boosted separately from the backlog. That's why >>>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in >>>>>>>> regards to all this :D >>>>>>> >>>>>>> Certainly not without a clear understanding what the problem with >>>>>>> a kthread is. >>>>>> >>>>>> Yes, sure thing. >>>>>> >>>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I >>>>>> was testing with the UDP trafficgen and got up to 80% improvement over >>>>>> the baseline. Now I tested TCP and got up to 70% improvement, no >>>>>> regressions whatsoever =\ >>>>>> >>>>>> I don't know where this regression on Daniel's setup comes from. Is it >>>>>> multi-thread or single-thread test? >>>>> >>>>> 8 threads with 16 flows over them (-T8 -F16) >>>>> >>>>>> What app do you use: iperf, netperf, >>>>>> neper, Microsoft's app (forgot the name)? >>>>> >>>>> neper, tcp_stream. >>>> >>>> Let me recheck with neper -T8 -F16, I'll post my results soon. >>> >>> kernel direct T1 direct T8F16 cpumap cpumap T8F16 >>> clean 28 51 13 9 Gbps >>> GRO 28 51 26 18 Gbps >>> >>> 100% gain, no regressions =\ >>> >>> My XDP prog is simple (upstream xdp-tools repo with no changes): >>> >>> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p >>> no-touch ens802f0np0 >>> >>> IOW it simply redirects everything to CPU 23 (same NUMA node) from any >>> Rx queue without looking into headers or packet. >>> Do you test with more sophisticated XDP prog? >> >> Great reminder... my prog is a bit more sophisticated. I forgot we were >> doing latency tracking by inserting a timestamp into frame metadata. But >> not clearing it after it was read on remote CPU, which disables GRO. So >> previous test was paying the penalty of fixed GRO overhead without >> getting any packet merges. >> >> Once I fixed up prog to reset metadata pointer I could see the wins. >> Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No >> latency changes. >> >> Sorry about the churn. > > No problem, crap happens sometimes :) > > Let me send my implementation on Monday-Wednesday. I'll include my UDP > and TCP test results, as well as yours (+18%). > > BTW would be great if you could give me a Tested-by tag, as I assume the > tests were fine and it works for you? Yep, worked great for me. Tested-by: Daniel Xu <dxu@dxuuu.xyz>