Message ID | cover.1726480607.git.lorenzo@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Introduce GRO support to cpumap codebase | expand |
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Mon, 16 Sep 2024 12:13:42 +0200 > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > NAPI-kthread pinned on the selected cpu. > > Changes in rfc v2: > - get rid of dummy netdev dependency > > Lorenzo Bianconi (3): > net: Add napi_init_for_gro routine > net: add napi_threaded_poll to netdevice.h > bpf: cpumap: Add gro support Oh okay, so it's still uses a NAPI. When I'm back from the conferences (next week), I might rebase and send the solution where I only use the GRO part of it, i.e. no napi_schedule()/poll()/napi_complete() logics. > > include/linux/netdevice.h | 3 + > kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > net/core/dev.c | 27 ++++++--- > 3 files changed, 73 insertions(+), 80 deletions(-) Thanks, Olek
Hi Lorenzo, On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > NAPI-kthread pinned on the selected cpu. > > Changes in rfc v2: > - get rid of dummy netdev dependency > > Lorenzo Bianconi (3): > net: Add napi_init_for_gro routine > net: add napi_threaded_poll to netdevice.h > bpf: cpumap: Add gro support > > include/linux/netdevice.h | 3 + > kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > net/core/dev.c | 27 ++++++--- > 3 files changed, 73 insertions(+), 80 deletions(-) > > -- > 2.46.0 > Sorry about the long delay - finally caught up to everything after conferences. I re-ran my synthetic tests (including baseline). v2 is somehow showing 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only variable I changed is kernel version - steering prog is active for both. Baseline (again) ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 cpumap NAPI patches v2 Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 Delta 1.04% -3.62% 7.41% 8.57% 30.47% Thanks, Daniel
> Hi Lorenzo, > > On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > > NAPI-kthread pinned on the selected cpu. > > > > Changes in rfc v2: > > - get rid of dummy netdev dependency > > > > Lorenzo Bianconi (3): > > net: Add napi_init_for_gro routine > > net: add napi_threaded_poll to netdevice.h > > bpf: cpumap: Add gro support > > > > include/linux/netdevice.h | 3 + > > kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > > net/core/dev.c | 27 ++++++--- > > 3 files changed, 73 insertions(+), 80 deletions(-) > > > > -- > > 2.46.0 > > > > Sorry about the long delay - finally caught up to everything after > conferences. > > I re-ran my synthetic tests (including baseline). v2 is somehow showing > 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only > variable I changed is kernel version - steering prog is active for both. > > > Baseline (again) > > ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 > Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 > Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 > Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 > Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 > Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 > > cpumap NAPI patches v2 > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 > Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 > Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 > Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 > Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 > Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 > Delta 1.04% -3.62% 7.41% 8.57% 30.47% > > Thanks, > Daniel Hi Daniel, cool, thx for testing it. @Olek: how do we want to proceed on it? Are you still working on it or do you want me to send a regular patch for it? Regards, Lorenzo
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Wed, 9 Oct 2024 12:46:00 +0200 >> Hi Lorenzo, >> >> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>> NAPI-kthread pinned on the selected cpu. >>> >>> Changes in rfc v2: >>> - get rid of dummy netdev dependency >>> >>> Lorenzo Bianconi (3): >>> net: Add napi_init_for_gro routine >>> net: add napi_threaded_poll to netdevice.h >>> bpf: cpumap: Add gro support >>> >>> include/linux/netdevice.h | 3 + >>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>> net/core/dev.c | 27 ++++++--- >>> 3 files changed, 73 insertions(+), 80 deletions(-) >>> >>> -- >>> 2.46.0 >>> >> >> Sorry about the long delay - finally caught up to everything after >> conferences. >> >> I re-ran my synthetic tests (including baseline). v2 is somehow showing >> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >> variable I changed is kernel version - steering prog is active for both. >> >> >> Baseline (again) >> >> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >> >> cpumap NAPI patches v2 >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >> >> Thanks, >> Daniel > > Hi Daniel, > > cool, thx for testing it. > > @Olek: how do we want to proceed on it? Are you still working on it or do you want me > to send a regular patch for it? Hi, I had a small vacation, sorry. I'm starting working on it again today. > > Regards, > Lorenzo Thanks, Olek
> From: Lorenzo Bianconi <lorenzo@kernel.org> > Date: Wed, 9 Oct 2024 12:46:00 +0200 > > >> Hi Lorenzo, > >> > >> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > >>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > >>> NAPI-kthread pinned on the selected cpu. > >>> > >>> Changes in rfc v2: > >>> - get rid of dummy netdev dependency > >>> > >>> Lorenzo Bianconi (3): > >>> net: Add napi_init_for_gro routine > >>> net: add napi_threaded_poll to netdevice.h > >>> bpf: cpumap: Add gro support > >>> > >>> include/linux/netdevice.h | 3 + > >>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > >>> net/core/dev.c | 27 ++++++--- > >>> 3 files changed, 73 insertions(+), 80 deletions(-) > >>> > >>> -- > >>> 2.46.0 > >>> > >> > >> Sorry about the long delay - finally caught up to everything after > >> conferences. > >> > >> I re-ran my synthetic tests (including baseline). v2 is somehow showing > >> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only > >> variable I changed is kernel version - steering prog is active for both. > >> > >> > >> Baseline (again) > >> > >> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 > >> > >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 > >> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 > >> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 > >> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 > >> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 > >> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 > >> > >> cpumap NAPI patches v2 > >> > >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 > >> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 > >> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 > >> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 > >> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 > >> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 > >> Delta 1.04% -3.62% 7.41% 8.57% 30.47% > >> > >> Thanks, > >> Daniel > > > > Hi Daniel, > > > > cool, thx for testing it. > > > > @Olek: how do we want to proceed on it? Are you still working on it or do you want me > > to send a regular patch for it? > > Hi, > > I had a small vacation, sorry. I'm starting working on it again today. ack, no worries. Are you going to rebase the other patches on top of it or are you going to try a different approach? Regards, Lorenzo > > > > > Regards, > > Lorenzo > > Thanks, > Olek
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Wed, 9 Oct 2024 14:47:58 +0200 >> From: Lorenzo Bianconi <lorenzo@kernel.org> >> Date: Wed, 9 Oct 2024 12:46:00 +0200 >> >>>> Hi Lorenzo, >>>> >>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>> NAPI-kthread pinned on the selected cpu. >>>>> >>>>> Changes in rfc v2: >>>>> - get rid of dummy netdev dependency >>>>> >>>>> Lorenzo Bianconi (3): >>>>> net: Add napi_init_for_gro routine >>>>> net: add napi_threaded_poll to netdevice.h >>>>> bpf: cpumap: Add gro support >>>>> >>>>> include/linux/netdevice.h | 3 + >>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>> net/core/dev.c | 27 ++++++--- >>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>> >>>>> -- >>>>> 2.46.0 >>>>> >>>> >>>> Sorry about the long delay - finally caught up to everything after >>>> conferences. >>>> >>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>> variable I changed is kernel version - steering prog is active for both. >>>> >>>> >>>> Baseline (again) >>>> >>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>> >>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>> >>>> cpumap NAPI patches v2 >>>> >>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>> >>>> Thanks, >>>> Daniel >>> >>> Hi Daniel, >>> >>> cool, thx for testing it. >>> >>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>> to send a regular patch for it? >> >> Hi, >> >> I had a small vacation, sorry. I'm starting working on it again today. > > ack, no worries. Are you going to rebase the other patches on top of it > or are you going to try a different approach? I'll try the approach without NAPI as Kuba asks and let Daniel test it, then we'll see. BTW I'm curious how he got this boost on v2, from what I see you didn't change the implementation that much? Thanks, Olek
From: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Wed, 9 Oct 2024 14:50:42 +0200 > From: Lorenzo Bianconi <lorenzo@kernel.org> > Date: Wed, 9 Oct 2024 14:47:58 +0200 > >>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>> Date: Wed, 9 Oct 2024 12:46:00 +0200 >>> >>>>> Hi Lorenzo, >>>>> >>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>>> NAPI-kthread pinned on the selected cpu. >>>>>> >>>>>> Changes in rfc v2: >>>>>> - get rid of dummy netdev dependency >>>>>> >>>>>> Lorenzo Bianconi (3): >>>>>> net: Add napi_init_for_gro routine >>>>>> net: add napi_threaded_poll to netdevice.h >>>>>> bpf: cpumap: Add gro support >>>>>> >>>>>> include/linux/netdevice.h | 3 + >>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>>> net/core/dev.c | 27 ++++++--- >>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>>> >>>>>> -- >>>>>> 2.46.0 >>>>>> >>>>> >>>>> Sorry about the long delay - finally caught up to everything after >>>>> conferences. >>>>> >>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>>> variable I changed is kernel version - steering prog is active for both. >>>>> >>>>> >>>>> Baseline (again) >>>>> >>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>>> >>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>>> >>>>> cpumap NAPI patches v2 >>>>> >>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>>> >>>>> Thanks, >>>>> Daniel >>>> >>>> Hi Daniel, >>>> >>>> cool, thx for testing it. >>>> >>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>>> to send a regular patch for it? >>> >>> Hi, >>> >>> I had a small vacation, sorry. I'm starting working on it again today. >> >> ack, no worries. Are you going to rebase the other patches on top of it >> or are you going to try a different approach? > > I'll try the approach without NAPI as Kuba asks and let Daniel test it, > then we'll see. For now, I have the same results without NAPI as with your series, so I'll push it soon and let Daniel test. (I simply decoupled GRO and NAPI and used the former in cpumap, but the kthread logic didn't change) > > BTW I'm curious how he got this boost on v2, from what I see you didn't > change the implementation that much? Thanks, Olek
From: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Tue, 22 Oct 2024 17:51:43 +0200 > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Wed, 9 Oct 2024 14:50:42 +0200 > >> From: Lorenzo Bianconi <lorenzo@kernel.org> >> Date: Wed, 9 Oct 2024 14:47:58 +0200 >> >>>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>>> Date: Wed, 9 Oct 2024 12:46:00 +0200 >>>> >>>>>> Hi Lorenzo, >>>>>> >>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>>>> NAPI-kthread pinned on the selected cpu. >>>>>>> >>>>>>> Changes in rfc v2: >>>>>>> - get rid of dummy netdev dependency >>>>>>> >>>>>>> Lorenzo Bianconi (3): >>>>>>> net: Add napi_init_for_gro routine >>>>>>> net: add napi_threaded_poll to netdevice.h >>>>>>> bpf: cpumap: Add gro support >>>>>>> >>>>>>> include/linux/netdevice.h | 3 + >>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>>>> net/core/dev.c | 27 ++++++--- >>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>>>> >>>>>>> -- >>>>>>> 2.46.0 >>>>>>> >>>>>> >>>>>> Sorry about the long delay - finally caught up to everything after >>>>>> conferences. >>>>>> >>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>>>> variable I changed is kernel version - steering prog is active for both. >>>>>> >>>>>> >>>>>> Baseline (again) >>>>>> >>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>>>> >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>>>> >>>>>> cpumap NAPI patches v2 >>>>>> >>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>>>> >>>>>> Thanks, >>>>>> Daniel >>>>> >>>>> Hi Daniel, >>>>> >>>>> cool, thx for testing it. >>>>> >>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>>>> to send a regular patch for it? >>>> >>>> Hi, >>>> >>>> I had a small vacation, sorry. I'm starting working on it again today. >>> >>> ack, no worries. Are you going to rebase the other patches on top of it >>> or are you going to try a different approach? >> >> I'll try the approach without NAPI as Kuba asks and let Daniel test it, >> then we'll see. > > For now, I have the same results without NAPI as with your series, so > I'll push it soon and let Daniel test. > > (I simply decoupled GRO and NAPI and used the former in cpumap, but the > kthread logic didn't change) > >> >> BTW I'm curious how he got this boost on v2, from what I see you didn't >> change the implementation that much? Hi Daniel, Sorry for the delay. Please test [0]. [0] https://github.com/alobakin/linux/commits/cpumap-old Thanks, Olek
On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Tue, 22 Oct 2024 17:51:43 +0200 > >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Date: Wed, 9 Oct 2024 14:50:42 +0200 >> >>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>> Date: Wed, 9 Oct 2024 14:47:58 +0200 >>> >>>>> From: Lorenzo Bianconi <lorenzo@kernel.org> >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200 >>>>> >>>>>>> Hi Lorenzo, >>>>>>> >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a >>>>>>>> NAPI-kthread pinned on the selected cpu. >>>>>>>> >>>>>>>> Changes in rfc v2: >>>>>>>> - get rid of dummy netdev dependency >>>>>>>> >>>>>>>> Lorenzo Bianconi (3): >>>>>>>> net: Add napi_init_for_gro routine >>>>>>>> net: add napi_threaded_poll to netdevice.h >>>>>>>> bpf: cpumap: Add gro support >>>>>>>> >>>>>>>> include/linux/netdevice.h | 3 + >>>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- >>>>>>>> net/core/dev.c | 27 ++++++--- >>>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) >>>>>>>> >>>>>>>> -- >>>>>>>> 2.46.0 >>>>>>>> >>>>>>> >>>>>>> Sorry about the long delay - finally caught up to everything after >>>>>>> conferences. >>>>>>> >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only >>>>>>> variable I changed is kernel version - steering prog is active for both. >>>>>>> >>>>>>> >>>>>>> Baseline (again) >>>>>>> >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 >>>>>>> >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 >>>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 >>>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 >>>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 >>>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 >>>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 >>>>>>> >>>>>>> cpumap NAPI patches v2 >>>>>>> >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >>>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 >>>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 >>>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 >>>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 >>>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 >>>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 >>>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% >>>>>>> >>>>>>> Thanks, >>>>>>> Daniel >>>>>> >>>>>> Hi Daniel, >>>>>> >>>>>> cool, thx for testing it. >>>>>> >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me >>>>>> to send a regular patch for it? >>>>> >>>>> Hi, >>>>> >>>>> I had a small vacation, sorry. I'm starting working on it again today. >>>> >>>> ack, no worries. Are you going to rebase the other patches on top of it >>>> or are you going to try a different approach? >>> >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it, >>> then we'll see. >> >> For now, I have the same results without NAPI as with your series, so >> I'll push it soon and let Daniel test. >> >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the >> kthread logic didn't change) >> >>> >>> BTW I'm curious how he got this boost on v2, from what I see you didn't >>> change the implementation that much? > > Hi Daniel, > > Sorry for the delay. Please test [0]. > > [0] https://github.com/alobakin/linux/commits/cpumap-old > > Thanks, > Olek Ack. Will do probably early next week.
Hi Olek, Here are the results. On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > From: Alexander Lobakin <aleksander.lobakin@intel.com> > > Date: Tue, 22 Oct 2024 17:51:43 +0200 > > > >> From: Alexander Lobakin <aleksander.lobakin@intel.com> > >> Date: Wed, 9 Oct 2024 14:50:42 +0200 > >> > >>> From: Lorenzo Bianconi <lorenzo@kernel.org> > >>> Date: Wed, 9 Oct 2024 14:47:58 +0200 > >>> > >>>>> From: Lorenzo Bianconi <lorenzo@kernel.org> > >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200 > >>>>> > >>>>>>> Hi Lorenzo, > >>>>>>> > >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > >>>>>>>> NAPI-kthread pinned on the selected cpu. > >>>>>>>> > >>>>>>>> Changes in rfc v2: > >>>>>>>> - get rid of dummy netdev dependency > >>>>>>>> > >>>>>>>> Lorenzo Bianconi (3): > >>>>>>>> net: Add napi_init_for_gro routine > >>>>>>>> net: add napi_threaded_poll to netdevice.h > >>>>>>>> bpf: cpumap: Add gro support > >>>>>>>> > >>>>>>>> include/linux/netdevice.h | 3 + > >>>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > >>>>>>>> net/core/dev.c | 27 ++++++--- > >>>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) > >>>>>>>> > >>>>>>>> -- > >>>>>>>> 2.46.0 > >>>>>>>> > >>>>>>> > >>>>>>> Sorry about the long delay - finally caught up to everything after > >>>>>>> conferences. > >>>>>>> > >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing > >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only > >>>>>>> variable I changed is kernel version - steering prog is active for both. > >>>>>>> > >>>>>>> > >>>>>>> Baseline (again) > >>>>>>> > >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 > >>>>>>> > >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 > >>>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 > >>>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 > >>>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 > >>>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 > >>>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 > >>>>>>> > >>>>>>> cpumap NAPI patches v2 > >>>>>>> > >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 > >>>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 > >>>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 > >>>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 > >>>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 > >>>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 > >>>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Daniel > >>>>>> > >>>>>> Hi Daniel, > >>>>>> > >>>>>> cool, thx for testing it. > >>>>>> > >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me > >>>>>> to send a regular patch for it? > >>>>> > >>>>> Hi, > >>>>> > >>>>> I had a small vacation, sorry. I'm starting working on it again today. > >>>> > >>>> ack, no worries. Are you going to rebase the other patches on top of it > >>>> or are you going to try a different approach? > >>> > >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it, > >>> then we'll see. > >> > >> For now, I have the same results without NAPI as with your series, so > >> I'll push it soon and let Daniel test. > >> > >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the > >> kthread logic didn't change) > >> > >>> > >>> BTW I'm curious how he got this boost on v2, from what I see you didn't > >>> change the implementation that much? > > > > Hi Daniel, > > > > Sorry for the delay. Please test [0]. > > > > [0] https://github.com/alobakin/linux/commits/cpumap-old > > > > Thanks, > > Olek > > Ack. Will do probably early next week. > Baseline (again) Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 cpumap v2 Olek Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 Delta 0.92% -0.53% 0.33% 0.85% -41.32% It's very interesting that we see -40% tput w/ the patches. I went back and double checked and it seems the numbers are right. Here's the some output from some profiles I took with: perf record -e cycles:k -a -- sleep 10 perf --no-pager diff perf.data.baseline perf.data.withpatches > ... # Event 'cycles:k' # Baseline Delta Abs Shared Object Symbol 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter 3.57% -2.56% bpf_prog_954ab9c8c8b5e42f_latency [k] bpf_prog_954ab9c8c8b5e42f_latency +2.22% bpf_prog_5c74b34eb24d5c9b_steering [k] bpf_prog_5c74b34eb24d5c9b_steering 2.61% -1.88% [kernel.kallsyms] [k] __skb_datagram_iter 0.55% +1.53% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 4.52% -1.46% [kernel.kallsyms] [k] read_tsc 0.34% +1.42% [kernel.kallsyms] [k] __slab_free 0.97% +1.18% [kernel.kallsyms] [k] do_idle 1.35% +1.17% [kernel.kallsyms] [k] cpuidle_enter_state 1.89% -1.15% [kernel.kallsyms] [k] tcp_ack 2.08% +1.14% [kernel.kallsyms] [k] _raw_spin_lock +1.13% <redacted> 0.22% +1.02% [kernel.kallsyms] [k] __sock_wfree 2.23% -1.02% [kernel.kallsyms] [k] bpf_dynptr_slice 0.00% +0.98% [kernel.kallsyms] [k] tcp6_gro_receive 2.91% -0.98% [kernel.kallsyms] [k] csum_partial 0.62% +0.94% [kernel.kallsyms] [k] skb_release_data +0.81% [kernel.kallsyms] [k] memset 0.16% +0.74% [kernel.kallsyms] [k] bnxt_tx_int 0.00% +0.74% [kernel.kallsyms] [k] dev_gro_receive 0.36% +0.74% [kernel.kallsyms] [k] __tcp_transmit_skb +0.72% [kernel.kallsyms] [k] tcp_gro_receive 1.10% -0.66% [kernel.kallsyms] [k] ep_poll_callback 1.52% -0.65% [kernel.kallsyms] [k] page_pool_put_unrefed_netmem 0.75% -0.57% [kernel.kallsyms] [k] bnxt_rx_pkt 1.10% +0.56% [kernel.kallsyms] [k] native_sched_clock 0.16% +0.53% <redacted> 0.83% -0.53% [kernel.kallsyms] [k] skb_try_coalesce 0.60% +0.53% [kernel.kallsyms] [k] eth_type_trans 1.65% -0.51% [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.14% +0.50% [kernel.kallsyms] [k] bnxt_start_xmit 0.54% -0.48% [kernel.kallsyms] [k] __skb_frag_unref 0.91% +0.48% [cls_bpf] [k] 0x0000000000000010 0.00% +0.47% [kernel.kallsyms] [k] ipv6_gro_receive 0.76% -0.45% [kernel.kallsyms] [k] tcp_rcv_established 0.94% -0.45% [kernel.kallsyms] [k] __inet6_lookup_established 0.31% +0.43% [kernel.kallsyms] [k] __sched_text_start 0.21% +0.43% [kernel.kallsyms] [k] poll_idle 0.91% -0.42% [kernel.kallsyms] [k] tcp_try_coalesce 0.91% -0.42% [kernel.kallsyms] [k] kmem_cache_free 1.13% +0.42% [kernel.kallsyms] [k] __bnxt_poll_work 0.48% -0.41% [kernel.kallsyms] [k] tcp_urg +0.39% [kernel.kallsyms] [k] memcpy 0.51% -0.38% [kernel.kallsyms] [k] _raw_read_unlock_irqrestore +0.38% [kernel.kallsyms] [k] __skb_gro_checksum_complete +0.37% [kernel.kallsyms] [k] irq_entries_start 0.16% +0.36% [kernel.kallsyms] [k] bpf_sk_storage_get 0.62% -0.36% [kernel.kallsyms] [k] page_pool_refill_alloc_cache 0.08% +0.35% [kernel.kallsyms] [k] ip6_finish_output2 0.14% +0.34% [kernel.kallsyms] [k] bnxt_poll_p5 0.06% +0.33% [sch_fq] [k] 0x0000000000000020 0.04% +0.32% [kernel.kallsyms] [k] __dev_queue_xmit 0.75% -0.32% [kernel.kallsyms] [k] __xdp_build_skb_from_frame 0.67% -0.31% [kernel.kallsyms] [k] sock_def_readable 0.05% +0.31% [kernel.kallsyms] [k] netif_skb_features +0.30% [kernel.kallsyms] [k] tcp_gro_pull_header 0.49% -0.29% [kernel.kallsyms] [k] napi_pp_put_page 0.18% +0.29% [kernel.kallsyms] [k] call_function_single_prep_ipi 0.40% -0.28% [kernel.kallsyms] [k] _raw_read_lock_irqsave 0.11% +0.27% [kernel.kallsyms] [k] raw6_local_deliver 0.18% +0.26% [kernel.kallsyms] [k] ip6_dst_check 0.42% -0.26% [kernel.kallsyms] [k] netif_receive_skb_list_internal 0.05% +0.26% [kernel.kallsyms] [k] __qdisc_run 0.75% +0.25% [kernel.kallsyms] [k] __build_skb_around 0.05% +0.25% [kernel.kallsyms] [k] htab_map_hash 0.09% +0.24% [kernel.kallsyms] [k] net_rx_action 0.07% +0.23% <redacted> 0.45% -0.23% [kernel.kallsyms] [k] migrate_enable 0.48% -0.23% [kernel.kallsyms] [k] mem_cgroup_charge_skmem 0.26% +0.23% [kernel.kallsyms] [k] __switch_to 0.15% +0.22% [kernel.kallsyms] [k] sock_rfree 0.30% -0.22% [kernel.kallsyms] [k] tcp_add_backlog <snip> 5.68% bpf_prog_17fea1bb6503ed98_steering [k] bpf_prog_17fea1bb6503ed98_steering 2.10% [kernel.kallsyms] [k] __skb_checksum_complete 0.71% [kernel.kallsyms] [k] __memset 0.54% [kernel.kallsyms] [k] __memcpy 0.18% [kernel.kallsyms] [k] __irqentry_text_start <snip> Please let me know if you want me to collect any other data. Thanks, Daniel
From: Daniel Xu <dxu@dxuuu.xyz> Date: Fri, 22 Nov 2024 17:10:06 -0700 > Hi Olek, > > Here are the results. > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >> >> >> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: [...] > Baseline (again) > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > cpumap v2 Olek > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > It's very interesting that we see -40% tput w/ the patches. I went back Oh no, I messed up something =\ Could you please also test not the whole series, but patches 1-3 (up to "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb array...")? Would be great to see whether this implementation works worse right from the start or I just broke something later on. > and double checked and it seems the numbers are right. Here's the > some output from some profiles I took with: > > perf record -e cycles:k -a -- sleep 10 > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > # Event 'cycles:k' > # Baseline Delta Abs Shared Object Symbol > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter BTW, what CONFIG_HZ do you have on the kernel you're testing with? Thanks, Olek
On Mon, Nov 25, 2024 at 04:12:24PM GMT, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > > Hi Olek, > > > > Here are the results. > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > >> > >> > >> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > [...] > > > Baseline (again) > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > > > cpumap v2 Olek > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > > > > It's very interesting that we see -40% tput w/ the patches. I went back > > Oh no, I messed up something =\ > > Could you please also test not the whole series, but patches 1-3 (up to > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > array...")? Would be great to see whether this implementation works > worse right from the start or I just broke something later on. Will do. > > > and double checked and it seems the numbers are right. Here's the > > some output from some profiles I took with: > > > > perf record -e cycles:k -a -- sleep 10 > > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > > > # Event 'cycles:k' > > # Baseline Delta Abs Shared Object Symbol > > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > > BTW, what CONFIG_HZ do you have on the kernel you're testing with? # zgrep CONFIG_HZ /proc/config.gz # CONFIG_HZ_PERIODIC is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 Just curious - why do you ask? Thanks, Daniel
On 25/11/2024 16.12, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Fri, 22 Nov 2024 17:10:06 -0700 > >> Hi Olek, >> >> Here are the results. >> >> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>> >>> >>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > [...] > >> Baseline (again) >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 >> Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 >> Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 >> Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 >> Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 >> Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 >> We need to talk about what we are measuring, and how to control the experiment setup to get reproducible results. Especially controlling on what CPU cores our code paths are executing. In above "baseline" case, we have two processes/tasks executing: (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) (2) Userspace netserver process TCP receiving data from socket. My experience is that you will see two noticeable different throughput performance results depending on whether (1) and (2) is executing on the *same* CPU (multi-tasking context-switching), or executing in parallel (e.g. pinned) on two different CPU cores. The netperf command have an option -T lcpu,remcpu Request that netperf be bound to local CPU lcpu and/or netserver be bound to remote CPU rcpu. Verify setting by listing pinning like this: for PID in $(pidof netserver); do taskset -pc $PID ; done You can also set pinning runtime like this: export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID; done For troubleshooting, I like to use the periodic 1 sec (netperf -D1) output and adjust pinning runtime to observe the effect quickly. My experience is unfortunately that TCP results have a lot of variation (thanks for incliding 5 runs in your benchmarks), as it depends on tasks timing, that can get affected by CPU sleep states. The systems CPU latency setting can be seen in /dev/cpu_dma_latency, which can be read like this: sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm as it requires holding the file open. E.g I play with these profiles: sudo tuned-adm profile throughput-performance sudo tuned-adm profile latency-performance sudo tuned-adm profile network-latency >> cpumap v2 Olek >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 >> Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 >> Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 >> Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 >> Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 >> Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 >> Delta 0.92% -0.53% 0.33% 0.85% -41.32% >> >> We now three processes/tasks executing: (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) (3) Userspace netserver process TCP receiving data from socket. Again, now the performance is going to depend on depending on which CPU cores the processes/tasks are running and whether some are sharing the same CPU. (There are both wakeup timing and cache-line effects). There are now more combinations to test... CPUmap is a CPU scaling facility, and you will likely also see different CPU utilization on the difference cores one you start to pin these to control the scenarios. >> It's very interesting that we see -40% tput w/ the patches. I went back > Sad that we see -40% throughput... but do we know what CPU cores the now three different tasks/processes run on(?) > Oh no, I messed up something =\ > > Could you please also test not the whole series, but patches 1-3 (up to > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > array...")? Would be great to see whether this implementation works > worse right from the start or I just broke something later on. > >> and double checked and it seems the numbers are right. Here's the >> some output from some profiles I took with: >> >> perf record -e cycles:k -a -- sleep 10 >> perf --no-pager diff perf.data.baseline perf.data.withpatches > ... >> >> # Event 'cycles:k' >> # Baseline Delta Abs Shared Object Symbol >> 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > I really appreciate that you provide perf data and perf diff, but as described above, we need data and information on what CPU cores are running which workload. Fortunately perf diff (and perf report) support doing like this: perf diff --sort=cpu,symbol But then you also need to control the CPUs used in experiment for the diff to work. I hope I made sense as these kind of CPU scaling benchmarks are tricky, --Jesper
Hi Jesper, On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote: > > > On 25/11/2024 16.12, Alexander Lobakin wrote: > > From: Daniel Xu <dxu@dxuuu.xyz> > > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > > > > Hi Olek, > > > > > > Here are the results. > > > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > > > > > > > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > > > [...] > > > > > Baseline (again) > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > > > > We need to talk about what we are measuring, and how to control the > experiment setup to get reproducible results. > Especially controlling on what CPU cores our code paths are executing. > > In above "baseline" case, we have two processes/tasks executing: > (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) > (2) Userspace netserver process TCP receiving data from socket. "baseline" in this case is still cpumap, just without these GRO patches. > > My experience is that you will see two noticeable different > throughput performance results depending on whether (1) and (2) is > executing on the *same* CPU (multi-tasking context-switching), > or executing in parallel (e.g. pinned) on two different CPU cores. > > The netperf command have an option > > -T lcpu,remcpu > Request that netperf be bound to local CPU lcpu and/or netserver be > bound to remote CPU rcpu. > > Verify setting by listing pinning like this: > for PID in $(pidof netserver); do taskset -pc $PID ; done > > You can also set pinning runtime like this: > export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID; > done > > For troubleshooting, I like to use the periodic 1 sec (netperf -D1) > output and adjust pinning runtime to observe the effect quickly. > > My experience is unfortunately that TCP results have a lot of variation > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks > timing, that can get affected by CPU sleep states. The systems CPU > latency setting can be seen in /dev/cpu_dma_latency, which can be read > like this: > > sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency > > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm > as it requires holding the file open. E.g I play with these profiles: > > sudo tuned-adm profile throughput-performance > sudo tuned-adm profile latency-performance > sudo tuned-adm profile network-latency Appreciate the tips - I should keep this saved somewhere. > > > > > cpumap v2 Olek > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > > > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > > > > > > > We now three processes/tasks executing: > (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) > (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) > (3) Userspace netserver process TCP receiving data from socket. > > Again, now the performance is going to depend on depending on which CPU > cores the processes/tasks are running and whether some are sharing the > same CPU. (There are both wakeup timing and cache-line effects). > > There are now more combinations to test... > > CPUmap is a CPU scaling facility, and you will likely also see different > CPU utilization on the difference cores one you start to pin these to > control the scenarios. > > > > It's very interesting that we see -40% tput w/ the patches. I went back > > > > Sad that we see -40% throughput... but do we know what CPU cores the > now three different tasks/processes run on(?) > Roughly, yes. For context, my primary use case for cpumap is to provide some degree of isolation between colocated containers on a single host. In particular, colocation occurs on AMD Bergamo. And containers are CPU pinned to their own CCX (roughly). My RX steering program ensures RX packets destined to a specific container are cpumap redirected to any of the container's pinned CPUs. It not only provides a good measure of isolation but ensures resources are properly accounted. So to answer your question of which CPUs the 3 things run on: cpumap kthread and application run on the same set of cores. More than that, they share the same L3 cache by design. irq/softirq is effectively random given default RSS config and IRQ affinities. > > > Oh no, I messed up something =\ > > > Could you please also test not the whole series, but patches 1-3 (up to > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > > array...")? Would be great to see whether this implementation works > > worse right from the start or I just broke something later on. > > > > > and double checked and it seems the numbers are right. Here's the > > > some output from some profiles I took with: > > > > > > perf record -e cycles:k -a -- sleep 10 > > > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > > > > > # Event 'cycles:k' > > > # Baseline Delta Abs Shared Object Symbol > > > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > > > > I really appreciate that you provide perf data and perf diff, but as > described above, we need data and information on what CPU cores are > running which workload. > > Fortunately perf diff (and perf report) support doing like this: > perf diff --sort=cpu,symbol > > But then you also need to control the CPUs used in experiment for the > diff to work. > > I hope I made sense as these kind of CPU scaling benchmarks are tricky, Indeed, sounds quite tricky. My understanding with GRO is that it's a powerful general purpose optimization. Enough that it should rise above the usual noise on a reasonably configured system (which mine is). Maybe we can consider decoupling the cpumap GRO enablement with the later optimizations? So in Olek's above series, patches 1-3 seem like they would still benefit from an simpler testbed. But the more targetted optimizations in patch 4+ would probably justify a de-noised setup. Possibly single host with xdp-trafficgen or something. Procedurally speaking, maybe it would save some wasted effort if everyone agreed on the general approach before investing more time into finer optimizations built on top of the basic GRO support? Thanks, Daniel
> Hi Jesper, > > On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote: > > > > > > On 25/11/2024 16.12, Alexander Lobakin wrote: > > > From: Daniel Xu <dxu@dxuuu.xyz> > > > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > > > > > > Hi Olek, > > > > > > > > Here are the results. > > > > > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > > > > > > > > > > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > > > > > [...] > > > > > > > Baseline (again) > > > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > > > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > > > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > > > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > > > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > > > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > > > > > > > We need to talk about what we are measuring, and how to control the > > experiment setup to get reproducible results. > > Especially controlling on what CPU cores our code paths are executing. > > > > In above "baseline" case, we have two processes/tasks executing: > > (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) > > (2) Userspace netserver process TCP receiving data from socket. > > "baseline" in this case is still cpumap, just without these GRO patches. > > > > > My experience is that you will see two noticeable different > > throughput performance results depending on whether (1) and (2) is > > executing on the *same* CPU (multi-tasking context-switching), > > or executing in parallel (e.g. pinned) on two different CPU cores. > > > > The netperf command have an option > > > > -T lcpu,remcpu > > Request that netperf be bound to local CPU lcpu and/or netserver be > > bound to remote CPU rcpu. > > > > Verify setting by listing pinning like this: > > for PID in $(pidof netserver); do taskset -pc $PID ; done > > > > You can also set pinning runtime like this: > > export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID; > > done > > > > For troubleshooting, I like to use the periodic 1 sec (netperf -D1) > > output and adjust pinning runtime to observe the effect quickly. > > > > My experience is unfortunately that TCP results have a lot of variation > > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks > > timing, that can get affected by CPU sleep states. The systems CPU > > latency setting can be seen in /dev/cpu_dma_latency, which can be read > > like this: > > > > sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency > > > > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm > > as it requires holding the file open. E.g I play with these profiles: > > > > sudo tuned-adm profile throughput-performance > > sudo tuned-adm profile latency-performance > > sudo tuned-adm profile network-latency > > Appreciate the tips - I should keep this saved somewhere. > > > > > > > > > cpumap v2 Olek > > > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > > > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > > > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > > > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > > > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > > > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > > > > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > > > > > > > > > > > > We now three processes/tasks executing: > > (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) > > (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) > > (3) Userspace netserver process TCP receiving data from socket. > > > > Again, now the performance is going to depend on depending on which CPU > > cores the processes/tasks are running and whether some are sharing the > > same CPU. (There are both wakeup timing and cache-line effects). > > > > There are now more combinations to test... > > > > CPUmap is a CPU scaling facility, and you will likely also see different > > CPU utilization on the difference cores one you start to pin these to > > control the scenarios. > > > > > > It's very interesting that we see -40% tput w/ the patches. I went back > > > > > > > Sad that we see -40% throughput... but do we know what CPU cores the > > now three different tasks/processes run on(?) > > > > Roughly, yes. For context, my primary use case for cpumap is to provide > some degree of isolation between colocated containers on a single host. > In particular, colocation occurs on AMD Bergamo. And containers are > CPU pinned to their own CCX (roughly). My RX steering program ensures > RX packets destined to a specific container are cpumap redirected to any > of the container's pinned CPUs. It not only provides a good measure of > isolation but ensures resources are properly accounted. > > So to answer your question of which CPUs the 3 things run on: cpumap > kthread and application run on the same set of cores. More than that, > they share the same L3 cache by design. irq/softirq is effectively > random given default RSS config and IRQ affinities. > > > > > > > Oh no, I messed up something =\ > > > > Could you please also test not the whole series, but patches 1-3 (up to > > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > > > array...")? Would be great to see whether this implementation works > > > worse right from the start or I just broke something later on. > > > > > > > and double checked and it seems the numbers are right. Here's the > > > > some output from some profiles I took with: > > > > > > > > perf record -e cycles:k -a -- sleep 10 > > > > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > > > > > > > # Event 'cycles:k' > > > > # Baseline Delta Abs Shared Object Symbol > > > > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > > > > > > > I really appreciate that you provide perf data and perf diff, but as > > described above, we need data and information on what CPU cores are > > running which workload. > > > > Fortunately perf diff (and perf report) support doing like this: > > perf diff --sort=cpu,symbol > > > > But then you also need to control the CPUs used in experiment for the > > diff to work. > > > > I hope I made sense as these kind of CPU scaling benchmarks are tricky, > > Indeed, sounds quite tricky. > > My understanding with GRO is that it's a powerful general purpose > optimization. Enough that it should rise above the usual noise on a > reasonably configured system (which mine is). > > Maybe we can consider decoupling the cpumap GRO enablement with the > later optimizations? I agree. First, we need to identify the best approach to enable GRO on cpumap (between Olek's approach and what I have suggested) and then we can evaluate subsequent optimizations. @Olek: do you agree? Regards, Lorenzo > > So in Olek's above series, patches 1-3 seem like they would still > benefit from an simpler testbed. But the more targetted optimizations in > patch 4+ would probably justify a de-noised setup. Possibly single host > with xdp-trafficgen or something. > > Procedurally speaking, maybe it would save some wasted effort if > everyone agreed on the general approach before investing more time into > finer optimizations built on top of the basic GRO support? > > Thanks, > Daniel >
On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Fri, 22 Nov 2024 17:10:06 -0700 > >> Hi Olek, >> >> Here are the results. >> >> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: >>> >>> >>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > [...] > >> Baseline (again) >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 >> Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 >> Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 >> Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 >> Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 >> Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 >> >> cpumap v2 Olek >> >> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) >> Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 >> Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 >> Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 >> Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 >> Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 >> Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 >> Delta 0.92% -0.53% 0.33% 0.85% -41.32% >> >> >> It's very interesting that we see -40% tput w/ the patches. I went back > > Oh no, I messed up something =\ > > Could you please also test not the whole series, but patches 1-3 (up to > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > array...")? Would be great to see whether this implementation works > worse right from the start or I just broke something later on. Patches 1-3 reproduces the -40% tput numbers. With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy. tcp_rr results were unaffected. Thanks, Daniel